An error occurred fetching the project authors.
  1. 12 Apr, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] speed up ext2 fsync() and fdatasync() · 7176142a
      Andrew Morton authored
      ext2_sync_file() forgets to clear the inode's dirty bits, so we write the
      inode on every fsync(), even if it hasn't changed.
      
      Fix that up via the new sync_file() API which correctly manages the inode
      state bits and the superblock inode lists.
      
      When performing file overwrite on IDE with and without writeback caching
      enabled this patch approximately doubles fsync() speed, bringing it into line
      with O_SYNC writes.
      
      Also, fix up the return value handling in ext2_sync_file().
      
      Credit due to Jeffrey Siegal <jbs@quiotix.com> who noticed the performance
      discrepancy and wrote a test app.
      7176142a
  2. 19 Jan, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] bdev: switch to f_mapping · 32d66678
      Andrew Morton authored
      From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>
      
      A lot of places used to use ->f_dentry->d_inode->i_mapping all over the
      place.  Replaced with use of ->f_mapping.  For now - just the places where we
      literally could do search-and-replace.
      32d66678
  3. 01 Oct, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] dev_t forward compatibility fix · 1885b3f1
      Andrew Morton authored
      From: Andries.Brouwer@cwi.nl
      
      ext2 used a 32-bit field for dev_t, with possibly undefined storage
      following; thus, no action was required to go to 32-bit dev_t, but going to
      64-bit dev_t required some subtlety: 0 was written in the first word and
      the 64 bits in the following two.  Al truncated my 64-bit stuff to 32 bits
      but did not understand why there was this split, and wrote 0 followed by a
      single word.  We should at least zero the word following to have
      well-defined storage later.
      1885b3f1
  4. 23 Sep, 2003 1 commit
    • Alexander Viro's avatar
      [PATCH] 32-bit dev_t: switch-over · 1c2c2a8f
      Alexander Viro authored
      Real conversion to 32bit dev_t.  Expansion to:
      	* mknod() - 32
      	* newstat() - 32 on 64bit platforms
      	* stat64() - 32 on mips, 64 on everything else (mips has weird struct
      stat64 and can't get more than 32 bits).  Note that right now the difference
      is purely theoretical - we don't have internal values above 32 bits, so
      huge_... vs. new_... only marks the places where 64bit conversion will need
      extra work.
      	* arch-dependent stat variants - depending on width available.
      	* ustat et.al. - 32
      	* filesystems that can handle 32 bits right now - 32
      	* ext2 and ext3 - 32, with large dev_t inodes having 0 in the first
      element of i_data[] (where we store dev_t value for small device numbers) and
      keeping the value in the second element.
      	* nfsd - 32; it can be driven to 64, but we'll get several issues with
      NFSv2 support.
      	* RAID - 32
      	* devmapper - with v1 it's still 16 (nothing to do here), with v4 it's
      64.
      	* loop - 64
      	* initramfs - 32
      	* do_mounts code - 32.  Parts that scan devfs tree are using newstat()
      on 64bit platforms and stat64() on the rest (IOW, the latest stat variant on
      given platform).
      	* old_valid_dev()/new_valid_dev() added where needed (stat variants,
      mostly - we fail with -EOVERFLOW if values do not fit).
      1c2c2a8f
  5. 05 Sep, 2003 2 commits
  6. 01 Aug, 2003 2 commits
    • Randy Dunlap's avatar
      [PATCH] don't init statics to 0 (fs/) · 9cf89014
      Randy Dunlap authored
      From: Leann Ogasawara <ogasawara@osdl.org>
      
      Uninitialize static variables initialized to 0 so they are pushed to the
      .bss instead of .data.
      9cf89014
    • Andrew Morton's avatar
      [PATCH] direct-io support for XFS unwritten extents · 359a5de1
      Andrew Morton authored
      From: Nathan Scott <nathans@sgi.com>
      
      This patch adds a mechanism by which a filesystem can register an interest in
      the completion of direct I/O.  The completion routine will be given the
      inode, an offset and a length, and an optional filesystem-private field.
      
      We have extended the use of the buffer_head-based interface (i.e.
      get_block_t) for direct I/O such that the b_private field is now utilised.
      It is defined to be initially zero at the start of I/O, and will be passed
      into the filesystem unmodified by the VFS with each map request, while
      setting up the direct I/O.  Once I/O has completed the final value of this
      pointer will be passed into a filesystems I/O completion handler.  This
      mechanism can be used to keep track of all of the mapping requests which
      encompass an individual direct I/O request.
      
      This has been implemented specifically for XFS, but is done so as to be as
      generic as possible.  XFS uses this mechanism to provide support for
      unwritten extents - these are file extents which have been pre-allocated
      on-disk, but not yet written to (once written, these become regular file
      extents, but only once I/O is complete).
      359a5de1
  7. 25 Jul, 2003 1 commit
  8. 03 Apr, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] handle bad inodes in put_inode · 68fa8120
      Andrew Morton authored
      From: "J. Bruce Fields" <bfields@fieldses.org>
      
      If the NFS daemon is presented with a filehandle for a file that has
      been deleted, it does an iget() in fs/exportfs/expfs.c:export_iget() and
      gets a bad inode back.  When it subsequently iput()s the inode, the
      result is:
      
      Mar 27 12:53:40 snoopy kernel: EXT2-fs error (device ide0(3,3)): ext2_free_blocks: Freeing blocks not in datazone - block = 1802201963, count = 27499
      Mar 27 12:53:40 snoopy kernel: Remounting filesystem read-only
      
      The same can happen if ext2_get_inode() returns an error - ext2_read_inode()
      will return an uninitialised inode and ext2_put_inode() is not allowed to go
      looking inside the bad inode.
      68fa8120
  9. 16 Mar, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] Ext2/3 noatime and dirsync fixes · 3bdfab20
      Andrew Morton authored
      Patch from "Theodore Ts'o" <tytso@mit.edu>
      
      I recently noticed a bug in ext2/3; newly created inodes which inherit
      the noatime flag from their containing directory do not respect noatime
      until the inode is flushed from the inode cache and then re-read later.
      This is because the code which checks the ext2 no-atime attribute and
      then sets the S_NOATIME in inode->i_flags is present in
      ext2_read_inode(), but not in ext2_new_inode().
      
      I fixed this in 2.4, and then found an even worse bug in the 2.5 code;
      the DIRSYNC flag is completely ignored *except* in the case where a
      directory is newly created using mkdir and its parent directory has the
      DIRSYNC flag.  S_DIRSYNC doesn't get set in the ext2_new_inode() or the
      ext2_ioctl() paths (which is used by chattr).
      
      This patch centralizes the code which translates the ext2 flags in the
      raw ext2 inode to the appropriate flag values in inode->i_flags in a
      single location.  This fixes the bug, makes things cleaner, and also
      removes 30 lines of code and 128 bytes of compiled x86 text in the
      bargain.
      3bdfab20
  10. 10 Feb, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] Fix synchronous writers to wait properly for the result · 8d49bf3f
      Andrew Morton authored
      Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in
      ll_rw_block() usage.
      
      Typical usage is:
      
      	mark_buffer_dirty(bh);
      	ll_rw_block(WRITE, 1, &bh);
      	wait_on_buffer(bh);
      
      the problem is that if the buffer was locked on entry to this code sequence
      (due to in-progress I/O), ll_rw_block() will not wait, and start new I/O.  So
      this code will wait on the _old_ I/O, and will then continue execution,
      leaving the buffer dirty.
      
      It turns out that all callers were only writing one buffer, and they were all
      waiting on that writeout.  So I added a new sync_dirty_buffer() function:
      
      	void sync_dirty_buffer(struct buffer_head *bh)
      	{
      		lock_buffer(bh);
      		if (test_clear_buffer_dirty(bh)) {
      			get_bh(bh);
      			bh->b_end_io = end_buffer_io_sync;
      			submit_bh(WRITE, bh);
      		} else {
      			unlock_buffer(bh);
      		}
      	}
      
      which allowed a fair amount of code to be removed, while adding the desired
      data-integrity guarantees.
      
      UFS has its own wrappers around ll_rw_block() which got in the way, so this
      operation was open-coded in that case.
      8d49bf3f
  11. 02 Feb, 2003 1 commit
  12. 08 Jan, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] AIO support for raw/O_DIRECT · 08e6749e
      Andrew Morton authored
      Patch from Badari Pulavarty <pbadari@us.ibm.com> and myself
      
      This patch adds the infrastructure for performing asynchronous (AIO) blockdev
      direct-IO.
      
      - Adds generic_file_aio_write_nolock() and make other
        generic_file_*_write() to use it.
      
      - Modify generic_file_direct_IO() and ->direct_IO() functions to take
        "kiocb *" instead of "file *".
      
      - Renames generic_direct_IO() to blockdev_direct_IO().
      
      - Move generic_file_direct_IO() to mm/filemap.c (it is not
        blockdev-specific, whereas the rest of fs/direct-io.c is).
      
      - Add AIO read/write support to the raw driver.
      08e6749e
  13. 21 Dec, 2002 1 commit
  14. 14 Dec, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] ext2 synchronous mount fix · 7cc9ee3d
      Andrew Morton authored
      The optimisation for synchronous mounts was only correct for S_ISREG
      files.  Directories do not pass through generic_osync_inode() and we
      still need to synchronously write out their indirect blocks.
      7cc9ee3d
    • Andrew Morton's avatar
      [PATCH] remove PF_SYNC · 577c516f
      Andrew Morton authored
      current->flags:PF_SYNC was a hack I added because I didn't want to
      change all ->writepage implementations.
      
      It's foul.  And it means that if someone happens to run direct page
      reclaim within the context of (say) sys_sync, the writepage invokations
      from the VM will be treated as "data integrity" operations, not "memory
      cleansing" operations, which would cause latency.
      
      So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage.
       It is the `writeback_control' structure which contains the full context
      information about why writepage was called.
      
      The initial version of this patch just passed in a bare `int sync', but
      the XFS team need more info so they can perform writearound from within
      page reclaim.
      
      The patch also adds writeback_control.for_reclaim, so writepage
      implementations can inspect that to work out the call context rather
      than peeking at current->flags:PF_MEMALLOC.
      577c516f
  15. 22 Nov, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] no-buffer-head ext2 option · b1ad1f4e
      Andrew Morton authored
      Implements a new set of block address_space_operations which will never
      attach buffer_heads to file pagecache.  These can be turned on for ext2
      with the `nobh' mount option.
      
      During write-intensive testing on a 7G machine, total buffer_head
      storage remained below 0.3 megabytes.  And those buffer_heads are
      against ZONE_NORMAL pagecache and will be reclaimed by ZONE_NORMAL
      memory pressure.
      
      This work is, of course, a special for the huge highmem machines.
      Possibly it obsoletes the buffer_heads_over_limit stuff (which doesn't
      work terribly well), but that code is simple, and will provide relief
      for other filesystems.
      
      
      It should be noted that the nobh_prepare_write() function and the
      PageMappedToDisk() infrastructure is what is needed to solve the
      problem of user data corruption when the filesystem which backs a
      sparse MAP_SHARED mapping runs out of space.  We can use this code in
      filemap_nopage() to ensure that all mapped pages have space allocated
      on-disk.  Deliver SIGBUS on ENOSPC.
      
      This will require a new address_space op, I expect.
      b1ad1f4e
    • Andrew Morton's avatar
      [PATCH] Remove mapping->vm_writeback · 53bf7bef
      Andrew Morton authored
      The vm_writeback address_space operation was designed to provide the VM
      with a "clustered writeout" capability.  It allowed the filesystem to
      perform more intelligent writearound decisions when the VM was trying
      to clean a particular page.
      
      I can't say I ever saw any real benefit from this - not much writeout
      actually happens on that path - quite a lot of work has gone into
      minimising it actually.
      
      The default ->vm_writeback a_op which I provided wrote back the pages
      in ->dirty_pages order.  But there is one scenario in which this causes
      problems - writing a single 4G file with mem=4G.  We end up with all of
      ZONE_NORMAL full of dirty pages, but all writeback effort is against
      highmem pages.  (Because there is about 1.5G of dirty memory total).
      
      Net effect: the machine stalls ZONE_NORMAL allocation attempts until
      the ->dirty_pages writeback advances onto ZONE_NORMAL pages.
      
      This can be fixed most sweetly with additional radix-tree
      infrastructure which will be quite complex.  Later.
      
      
      So this patch dumps it all, and goes back to using writepage
      against individual pages as they come off the LRU.
      53bf7bef
  16. 17 Nov, 2002 1 commit
    • Andi Kleen's avatar
      [PATCH] nanosecond stat timefields · 5d62665d
      Andi Kleen authored
      stat64 has been changed to return jiffies granuality as nsec in previously
      unused fields. This allows make to make better decisions on when
      to recompile a file. Follows losely the Solaris API.
      
      CURRENT_TIME has been redefined to return struct timespec.  The users
      who don't use it in a inode/attr context have been changed to use a new
      get_seconds() function.  CURRENT_TIME is implemented by an out-of-line
      function.
      
      There is a small performance penalty in this patch.  The previous
      filemap code had an optimization to flush atime only once a second.
      This is currently gone, which will increase flushes a bit.  I believe
      the correct solution if it should be a problem is to have per super
      block fields that give an arbitary atime flush granuality - so that you
      can set it to be only flushed once a hour if you prefer that.  I will
      work on that later in separate patches if the need should arise.
      
      struct inode and the attr struct has been changed to store struct
      timespec instead of time_t for [cma]time.  Not all file systems support
      this granuality, but some like XFS,NFSv3,CIFS,JFS do.  The others will
      currently truncate the nsec part on flushing to disk.  There was some
      discussion on this rounding on l-k previously.  I went for simple
      truncation because there is not much evidence IMHO that the more
      complicated roundings have any advantages.  In practice application will
      be rather unlikely to notice the rounding anyways - they can only see a
      difference when an inode is flush from memory and reloaded in less than
      a second, which is rather unlikely.
      5d62665d
  17. 05 Nov, 2002 2 commits
    • Trond Myklebust's avatar
      [PATCH] Make ->readpages palatable to NFS · b729e488
      Trond Myklebust authored
      The following patch makes the ->readpages() address_space_operation
      take a struct file argument just like ->readpage().
      b729e488
    • Andrew Morton's avatar
      [PATCH] `event' removal: ext2 · 9aefc010
      Andrew Morton authored
      Patch from Manfred Spraul
      
      Use a local counter instead of the global 'event' variable for the
      readdir() optimization.
      
      Depends on patch-event-II
      
      Background:
        The only user of i_version and f_version in ext2 is
        ext2_readdir(). As an optimization, ext2 performs the
        validation of the start position for readdir() only if
              flip->f_version != inode->i_version.
        If there was no llseek and no directory change since the
        last readdir() call, then f_pos can be trusted.
        f_version is set to 0 in get_empty_flip and during llseek.
        Right now, i_version set to ++event during ext2_read_inode
        and commit_chunk, i.e. at inode creation and if a directory
        is changed.
        Initializing i_version to 1, and updating with i_version++
        achieves the same effect, without the need of a global variable.
        Global uniqueness is not required, there are no other uses
        of [if]_version in ext2.
      
      Change relative to the patch you have right now:
      i_version is initialized to 1 instead of 0. For ext2 it's doesn't
      matter [there is always a valid 'len' value at the beginning of a
      directory data block], but it's cleaner.
      9aefc010
  18. 30 Oct, 2002 3 commits
  19. 29 Oct, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] permit direct IO with finer-than-fs-blocksize alignments · 4a4c6811
      Andrew Morton authored
      Mainly from Badari Pulavarty
      
      Traditionally we have only supported O_DIRECT I/O at an alignment and
      granularity which matches the underlying filesystem.  That typically
      means that all IO must be 4k-aligned and a multiple of 4k in size.
      
      Here, we relax that so that direct I/O happens with (typically)
      512-byte alignment and multiple-of-512-byte size.
      
      The tricky part is when a write starts and/or ends partway through a
      filesystem block which has just been added.  We need to zero out the
      parts of that block which lie outside the written region.
      
      We handle that by putting appropriately-sized parts of the ZERO_PAGE
      into sepatate BIOs.
      
      The generic_direct_IO() function has been changed so that the
      filesystem must pass in the address of the block_device against which
      the IO is to be performed.  I'd have preferred to not do this, but we
      do need that info at that time so that alignment checks can be
      performed.
      
      If the filesystem passes in a NULL block_device pointer then we fall
      back to the old behaviour - must align with the fs blocksize.
      
      There is no trivial way for userspace to know what the minimum
      alignment is - it depends on what bdev_hardsect_size() says about the
      device.  It is _usually_ 512 bytes, but not always.  This introduces
      the risk that someone will develop and test applications which work
      fine on their hardware, but will fail on someone else's hardware.
      
      It is possible to query the hardsect size using the BLKSSZGET ioctl
      against the backing block device.  This can be performed at runtime or
      at application installation time.
      4a4c6811
  20. 12 Oct, 2002 1 commit
    • Richard Henderson's avatar
      Fix warnings of the form · 2a022093
      Richard Henderson authored
        warning: long int format, different type arg (arg 5)
      by casting ino_t arguments to unsigned long for printf formats.
      In some instances, change %ld to %lu.
      2a022093
  21. 09 Oct, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] 64-bit sector_t - filesystems · 763fb9a3
      Andrew Morton authored
      From Peter Chubb
      
      Filesystem migration to possibly 64-bit sector_t:
       - bmap() now takes and returns a sector_t to allow filesystems
         (e.g., JFS, XFS) that are 64-bit clean to deal with large files
       - buffer handling now 64-bit clean
      
      Enable 64-bit sector_t on IA32 and PPC.
      
      kiobufs takes sector_t array, not array of long.
      Fix blkmtd.c to deal in such an array.
      
      Miscellaneous fixes for 64-bit sector_t.
       	 - missed printk formats
      	 - ide_floppy_do_request had incorrect signature
      	 - in blkmtd.c there was a pointer used to
      	   manipulate an array to be used by kiobuf --
       	   it was unsigned long, needed to be sector_t
      763fb9a3
  22. 07 Oct, 2002 1 commit
    • Chuck Lever's avatar
      [PATCH] add struct file* to ->direct_IO addr space op · 3a453bd4
      Chuck Lever authored
      This makes file credentials available to the ->direct_IO address space
      operation by replacing its struct inode* argument with a struct file*
      argument.  this patch is a prerequisite for NFS direct I/O support.  it
      breaks the raw device driver.
      3a453bd4
  23. 05 Oct, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] remove write_mapping_buffers() · 4ac833da
      Andrew Morton authored
      When the global buffer LRU was present, dirty ext2 indirect blocks were
      automatically scheduled for writeback alongside their data.
      
      I added write_mapping_buffers() to replace this - the idea was to
      schedule the indirects close in time to the scheduling of their data.
      
      It works OK for small-to-medium sized files but for large, linear writes
      it doesn't work: the request queue is completely full of file data and
      when we later come to scheduling the indirects, their neighbouring data
      has already been written.
      
      So writeback of really huge files tends to be a bit seeky.
      
      So.  Kill it.  Will fix this problem by other means.
      4ac833da
  24. 19 Sep, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] clean up argument passing in writeback paths · 967e6864
      Andrew Morton authored
      The writeback code paths which walk the superblocks and inodes are
      getting an increasing arguments passed to them.
      
      The patch wraps those args into the new `struct writeback_control',
      and uses that instead.  There is no functional change.
      
      The new writeback_control structure is passed down through the
      writeback paths in the place where the old `nr_to_write' pointer used
      to be.
      
      writeback_control will be used to pass new information up and down the
      writeback paths.  Such as whether the writeback should be non-blocking,
      and whether queue congestion was encountered.
      967e6864
  25. 13 Sep, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] readv/writev speedup · a83638a4
      Andrew Morton authored
      This is Janet Morgan's patch which converts the readv/writev code
      to submit all segments for IO before waiting on them, rather than
      submitting each segment separately.
      
      This is a critical performance fix for O_DIRECT reads and writes.
      Prior to this change, O_DIRECT vectored IO was forced to wait for
      completion against each segment of the iovec rather than submitting all
      segments and waiting on the lot.  ie: for ten segments, this code will
      be ten times faster.
      
      There will also be moderate improvements for buffered IO - smaller code
      paths, plus writev() only takes i_sem once.
      
      The patch ended up quite large unfortunately - turned out that the only
      sane way to implement this without duplicating significant amounts of
      code (the generic_file_write() bounds checking, all the O_DIRECT
      handling, etc) was to redo generic_file_read() and generic_file_write()
      to take an iovec/nr_segs pair rather than `buf, count'.
      
      New exported functions generic_file_readv() and generic_file_writev()
      have been added:
      
      ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
                                unsigned long nr_segs, loff_t *ppos);
      ssize_t generic_file_writev(struct file *file, const struct iovec *iov,
                                unsigned long nr_segs, loff_t * ppos);
      
      If a driver does not use these in their file_operations then they will
      continue to use the old readv/writev code, which sits in a loop calling
      calls fops->read() or fops->write().
      
      ext2, ext3, JFS and the blockdev driver are currently using this
      capability.
      
      Some coding cleanups were made in fs/read_write.c.  Mainly:
      
      - pass "READ" or "WRITE" around to indicate the diretion of the
        operation, rather than the (confusing, inverted)
        VERIFY_READ/VERIFY_WRITE.
      
      - Use the identifier `nr_segs' everywhere to indicate the iovec
        length rather than `count', which is often used to indicate the
        number of bytes in the syscall.  It was confusing the heck out of me.
      
      - Some cleanups to the raw driver.
      
      - Some additional generality in fs/direct_io.c: the core `struct dio'
        used to be a "populate-and-go" thing.  Janet has broken that up so
        you can initialise a struct dio once, then loop around feeding it
        more file segments, then wait on completion against everything.
      
      - In a couple of places we needed to handle the situation where we
        knew, a-priori, that the user was going to get a short read or write.
        File size limit exceeded, read past i_size, etc.  We handled that by
        shortening the iovec in-place with iov_shorten().  Which is not
        particularly pretty, but neither were the alternatives.
      a83638a4
  26. 13 Aug, 2002 1 commit
  27. 28 Jul, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] direct IO updates · 0d85f8bf
      Andrew Morton authored
      This patch is a performance and correctness update to the direct-IO
      code: O_DIRECT and the raw driver.  It mainly affects IO against
      blockdevs.
      
      The direct_io code was returning -EINVAL for a filesystem hole.  Change
      it to clear the userspace page instead.
      
      There were a few restrictions and weirdnesses wrt blocksize and
      alignments.  The code has been reworked so we now lay out maximum-sized
      BIOs at any sector alignment.
      
      Because of this, the raw driver has been altered to set the blockdev's
      soft blocksize to the minimum possible at open() time.  Typically, 512
      bytes.  There are now no performance disadvantages to using small
      blocksizes, and this gives the finest possible alignment.
      
      There is no API here for setting or querying the soft blocksize of the
      raw driver (there never was, really), which could conceivably be a
      problem.  If it is, we can permit BLKBSZSET and BLKBSZGET against the
      fd which /dev/raw/rawN returned, but that would require that
      blk_ioctl() be exported to modules again.
      
      This code is wickedly quick.  Here's an oprofile of a single 500MHz
      PIII reading from four (old) scsi disks (two aic7xxx controllers) via
      the raw driver.  Aggregate throughput is 72 megabytes/second:
      
      c013363c 24       0.0896492   __set_page_dirty_buffers
      c021b8cc 24       0.0896492   ahc_linux_isr
      c012b5dc 25       0.0933846   kmem_cache_free
      c014d894 26       0.09712     dio_bio_complete
      c01cc78c 26       0.09712     number
      c0123bd4 40       0.149415    follow_page
      c01eed8c 46       0.171828    end_that_request_first
      c01ed410 49       0.183034    blk_recount_segments
      c01ed574 65       0.2428      blk_rq_map_sg
      c014db38 85       0.317508    do_direct_IO
      c021b090 90       0.336185    ahc_linux_run_device_queue
      c010bb78 236      0.881551    timer_interrupt
      c01052d8 25354    94.707      poll_idle
      
      A testament to the efficiency of the 2.5 block layer.
      
      And against four IDE disks on an HPT374 controller.  Throughput is 120
      megabytes/sec:
      
      c01eed8c 80       0.292462    end_that_request_first
      c01fe850 87       0.318052    hpt3xx_intrproc
      c01ed574 123      0.44966     blk_rq_map_sg
      c01f8f10 141      0.515464    ata_select
      c014db38 153      0.559333    do_direct_IO
      c010bb78 235      0.859107    timer_interrupt
      c01f9144 281      1.02727     ata_irq_enable
      c01ff990 290      1.06017     udma_pci_init
      c01fe878 308      1.12598     hpt3xx_maskproc
      c02006f8 379      1.38554     idedisk_do_request
      c02356a0 609      2.22637     pci_conf1_read
      c01ff8dc 611      2.23368     udma_pci_start
      c01ff950 922      3.37062     udma_pci_irq_status
      c01f8fac 1002     3.66308     ata_status
      c01ff26c 1059     3.87146     ata_start_dma
      c01feb70 1141     4.17124     hpt374_udma_stop
      c01f9228 3072     11.2305     ata_out_regfile
      c01052d8 15193    55.5422     poll_idle
      
      Not so good.
      
      One problem which has been identified with O_DIRECT is the cost of
      repeated calls into the mapping's get_block() callback.  Not a big
      problem with ext2 but other filesystems have more complex get_block
      implementations.
      
      So what I have done is to require that callers of generic_direct_IO()
      implement the new `get_blocks()' interface.  This is a small extension
      to get_block().  It gets passed another argument which indicates the
      maximum number of blocks which should be mapped, and it returns the
      number of blocks which it did map in bh_result->b_size.  This allows
      the fs to map up to 4G of disk (or of hole) in a single get_block()
      invokation.
      
      There are some other caveats and requirements of get_blocks() which are
      documented in the comment block over fs/direct_io.c:get_more_blocks().
      
      Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block
      mapping.  It certainly allows good speedups.  But it doesn't allow the
      fs to return a scatter list of blocks - it only understands linear
      chunks of disk.  I think that's really all it _should_ do.
      
      I'll let get_blocks() sit for a while and wait for some feedback.  If
      it is sufficient and nobody objects too much, I shall convert all
      get_block() instances in the kernel to be get_blocks() instances.  And
      I'll teach readahead (at least) to use the get_blocks() extension.
      
      Delayed allocate writeback could use get_blocks().  As could
      block_prepare_write() for blocksize < PAGE_CACHE_SIZE.  There's no
      mileage using it in mpage_writepages() because all our filesystems are
      syncalloc, and nobody uses MAP_SHARED for much.
      
      It will be tricky to use get_blocks() for writes, because if a ton of
      blocks have been mapped into the file and then something goes wrong,
      the kernel needs to either remove those blocks from the file or zero
      them out.  The direct_io code zeroes them out.
      
      btw, some time ago you mentioned that some drivers and/or hardware may
      get upset if there are multiple simultaneous IOs in progress against
      the same block.  Well, the raw driver has always allowed that to
      happen.  O_DIRECT writes to blockdevs do as well now.
      
      todo:
      
      1) The driver will probably explode if someone runs BLKBSZSET while
         IO is in progress.  Need to use bdclaim() somewhere.
      
      2) readv() and writev() need to become direct_io-aware.  At present
         we're doing stop-and-wait for each segment when performing
         readv/writev against the raw driver and O_DIRECT blockdevs.
      0d85f8bf
  28. 15 Jul, 2002 1 commit
    • Andreas Dilger's avatar
      [PATCH] 2.5 i_size_high fixup · 884e7cce
      Andreas Dilger authored
       this patch is a minor fixup to ext2/inode.c to avoid displaying the
       high 32 bits of the size for anything other than regular files.  For
       sockets, pipes, symlinks, etc it doesn't make sense to have a value
       larger than 2GB, and this has already been fixed in ext3 and e2fsprogs.
      884e7cce
  29. 14 Jul, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] direct-to-BIO for O_DIRECT · 42ec8bc1
      Andrew Morton authored
      Here's a patch which converts O_DIRECT to go direct-to-BIO, bypassing
      the kiovec layer.  It's followed by a patch which converts the raw
      driver to use the O_DIRECT engine.
      
      CPU utilisation is about the same as the kiovec-based implementation.
      Read and write bandwidth are the same too, for 128k chunks.   But with
      one megabyte chunks, this implementation is 20% faster at writing.
      
      I assume this is because the kiobuf-based implementation has to stop
      and wait for each 128k chunk, whereas this code streams the entire
      request, regardless of its size.
      
      This is with a single (oldish) scsi disk on aic7xxx.  I'd expect the
      margin to widen on higher-end hardware which likes to have more
      requests in flight.
      
      Question is: what do we want to do with this sucker?  These are the
      remaining users of kiovecs:
      
      	drivers/md/lvm-snap.c
      	drivers/media/video/video-buf.c
      	drivers/mtd/devices/blkmtd.c
      	drivers/scsi/sg.c
      
      the video and mtd drivers seems to be fairly easy to de-kiobufize.
      I'm aware of one proprietary driver which uses kiobufs.  XFS uses
      kiobufs a little bit - just to map the pages.
      
      So with a bit of effort and maintainer-irritation, we can extract
      the kiobuf layer from the kernel.
      42ec8bc1
  30. 12 Jun, 2002 1 commit
  31. 27 May, 2002 3 commits
    • Andrew Morton's avatar
      [PATCH] dirsync · bb772c58
      Andrew Morton authored
      An implementation of directory-synchronous mounts.
      
      I sent this out some months ago and it didn't generate a lot of
      interest.  Later we had one of the usual cheery exchanges with Wietse
      Venema (postfix development) and he agreed that directory synchronous
      mounts were something that he could use, and that there was benefit in
      implementing them in Linux.  If you choose to apply this I'll push the
      2.4 patch.
      
      
      
      Patch against e2fsprogs-1.26:
              http://www.zip.com.au/~akpm/linux/dirsync/e2fsprogs-1.26.patch
      
      Patch against util-linux-2.11n:
              http://www.zip.com.au/~akpm/linux/dirsync/util-linux-2.11n.patch
      
      
      The kernel patch includes implementations for ext2 and ext3. It's
      pretty simple.
      
      - When dirsync is in operation against a directory, the following operations
        are synchronous within that directory:  create, link, unlink, symlink,
        mkdir, rmdir, mknod, rename (synchronous if either the source or dest
        directory is dirsync).
      
      - dirsync is a subset of sync.  So `mount -o sync' or `chattr +S'
        give you everything which `mount -o dirsync' or `chattr +D' gives,
        plus synchronous file writes.
      
      - ext2's inode.i_attr_flags is unused, and is removed.
      
      - mount /dev/foo /mnt/bar -o dirsync  works as expected.
      
      - An ext2 or ext3 directory tree can be set dirsync with `chattr +D -R'.
      
      - dirsync is maintained as new directories are created under
        a `chattr +D' directory.  Like `chattr +S'.
      
      - Other filesystems can trivially be taught about dirsync.  It's just
        a matter of replacing `IS_SYNC(inode)' with `IS_DIRSYNC(inode)' in
        the directory update functions.  IS_SYNC will still be honoured when
        IS_DIRSYNC is used.
      
      - Non-directory files do not have their dirsync flag propagated.  So
        an S_ISREG file which is created inside a dirsync directory will not
        have its dirsync bit set.  chattr needs to do this as well.
      
      - There was a bit of version skew between e2fsprogs' idea of the
        inode flags and the kernel's.  That is sorted out here.
      
      - `lsattr' shows the dirsync flag as "D".  The letter "D" was
        previously being used for Compressed_Dirty_File.  I changed
        Compressed_Dirty_File to use "Z".  Is that OK?
      
      The mount(2) manpage needs to be taught about MS_DIRSYNC.
      bb772c58
    • Andrew Morton's avatar
      [PATCH] rename writeback_mapping to writepages · 7d608fac
      Andrew Morton authored
      Spot the difference:
      
      aops.readpage
      aops.readpages
      aops.writepage
      aops.writeback_mapping
      
      The patch renames `writeback_mapping' to `writepages'
      7d608fac
    • Andrew Morton's avatar
      [PATCH] direct-to-BIO writeback · ab9e8941
      Andrew Morton authored
      Multipage BIO writeout from the pagecache.
      
      It's pretty much the same as multipage reads.  It falls back to buffers
      if things got complex.
      
      The write case is a little more complex because it handles pages which
      have buffers and pages which do not.  If the page didn't have buffers
      this code does not add them.
      ab9e8941