1. 30 Apr, 2002 18 commits
    • Andrew Morton's avatar
      [PATCH] cleanup sync_buffers() · b960fa03
      Andrew Morton authored
      Renames sync_buffers() to sync_blockdev() and removes its (never used)
      second argument.
      
      Removes fsync_no_super() in favour of direct calls to sync_blockdev().
      b960fa03
    • Andrew Morton's avatar
      [PATCH] page writeback locking update · a2bcb3a0
      Andrew Morton authored
      - Fixes a performance problem - callers of
        prepare_write/commit_write, etc are locking pages, which synchronises
        them behind writeback, which also locks these pages.  Significant
        slowdowns for some workloads.
      
      - So pages are no longer locked while under writeout.  Introduce a
        new PG_writeback and associated infrastructure to support this design
        change.
      
      - Pages which are under read I/O still use PageLocked.  Pages which
        are under write I/O have PageWriteback() true.
      
        I considered creating Page_IO instead of PageWriteback, and marking
        both readin and writeout pages as PageIO().  So pages are unlocked
        during both read and write.  There just doesn't seem a need to do
        this - nobody ever needs unblocking access to a page which is under
        read I/O.
      
      - Pages under swapout (brw_page) are PageLocked, not PageWriteback.
        So their treatment is unchangeded.
      
        It's not obvious that pages which are under swapout actually need
        the more asynchronous behaviour of PageWriteback.
      
        I was setting the swapout pages PageWriteback and unlocking them
        prior to submitting the buffers in brw_page().  This led to deadlocks
        on the exit_mmap->zap_page_range->free_swap_and_cache path.  These
        functions call block_flushpage under spinlock.  If the page is
        unlocked but has locked buffers, block_flushpage->discard_buffer()
        sleeps.  Under spinlock.  So that will need fixing if for some reason
        we want swapout to use PageWriteback.
      
        Kernel has called block_flushpage() under spinlock for a long time.
         It is assuming that a locked page will never have locked buffers.
        This appears to be true, but it's ugly.
      
      - Adds new function wait_on_page_writeback().  Renames wait_on_page()
        to wait_on_page_locked() to remind people that they need to call the
        appropriate one.
      
      - Renames filemap_fdatasync() to filemap_fdatawrite().  It's more
        accurate - "sync" implies, if anything, writeout and wait.  (fsync,
        msync) Or writeout.  it's not clear.
      
      - Subtly changes the filemap_fdatawrite() internals - this function
        used to do a lock_page() - it waited for any other user of the page
        to let go before submitting new I/O against a page.  It has been
        changed to simply skip over any pages which are currently under
        writeback.
      
        This is the right thing to do for memory-cleansing reasons.
      
        But it's the wrong thing to do for data consistency operations (eg,
        fsync()).  For those operations we must ensure that all data which
        was dirty *at the time of the system call* are tight on disk before
        the call returns.
      
        So all places which care about this have been converted to do:
      
      	filemap_fdatawait(mapping);	/* Wait for current writeback */
      	filemap_fdatawrite(mapping);	/* Write all dirty pages */
      	filemap_fdatawait(mapping);	/* Wait for I/O to complete */
      
      - Fixes a truncate_inode_pages problem - truncate currently will
        block when it hits a locked page, so it ends up getting into lockstep
        behind writeback and all of the file is pointlessly written back.
      
        One fix for this is for truncate to simply walk the page list in the
        opposite direction from writeback.
      
        I chose to use a separate cleansing pass.  It is more
        CPU-intensive, but it is surer and clearer.  This is because there is
        no reason why the per-address_space ->vm_writeback and
        ->writeback_mapping functions *have* to perform writeout in
        ->dirty_pages order.  They may choose to do something totally
        different.
      
        (set_page_dirty() is an a_op now, so address_spaces could almost
        privatise the whole dirty-page handling thing.  Except
        truncate_inode_pages and invalidate_inode_pages assume that the pages
        are on the address_space lists.  hmm.  So making truncate_inode_pages
        and invalidate_inode_pages a_ops would make some sense).
      a2bcb3a0
    • Andrew Morton's avatar
      [PATCH] hashed b_wait · f15fe424
      Andrew Morton authored
      Implements hashed waitqueues for buffer_heads.  Drops twelve bytes from
      struct buffer_head.
      f15fe424
    • Andrew Morton's avatar
      [PATCH] cleanup of bh->flags · 39e8cdf7
      Andrew Morton authored
      Moves all buffer_head-related stuff out of linux/fs.h and into
      linux/buffer_head.h.  buffer_head.h is currently included at the very
      end of fs.h.  So it is possible to include buffer_head directly from
      all .c files and remove this nested include.
      
      Also rationalises all the set_buffer_foo() and mark_buffer_bar()
      functions.  We have:
      
      	set_buffer_foo(bh)
      	clear_buffer_foo(bh)
      	buffer_foo(bh)
      
      and, in some cases, where needed:
      
      	test_set_buffer_foo(bh)
      	test_clear_buffer_foo(bh)
      
      And that's it.
      
      BUFFER_FNS() and TAS_BUFFER_FNS() macros generate all the above real
      inline functions.  Normally not a big fan of cpp abuse, but in this
      case it fits.  These function-generating macros are available to
      filesystems to expand their own b_state functions.  JBD uses this in
      one case.
      39e8cdf7
    • Andrew Morton's avatar
      [PATCH] remove show_buffers() · 411973b4
      Andrew Morton authored
      Remove show_buffers().  It really has nothing to show any more.  just
      buffermem_pages() - move that out into the callers.
      
      There's a lot of duplication in this code.  better approach would be to
      remove all the duplicated code out in the architectures and implement
      generic show_memory_state().  Later.
      411973b4
    • Andrew Morton's avatar
      [PATCH] remove PG_skip · 8dcf47bd
      Andrew Morton authored
      Remove PG_skip.  Nothing is using it (the change was acked by rmk a
      while back)
      8dcf47bd
    • Andrew Morton's avatar
      [PATCH] remove i_dirty_data_buffers · 7d513234
      Andrew Morton authored
      Removes inode.i_dirty_data_buffers.  It's no longer used - all dirty
      buffers have their pages marked dirty and filemap_fdatasync() /
      filemap_fdatawait() catches it all.
      
      Updates all callers.
      
      This required a change in JFS - it has "metapages" which
      are a container around a page which holds metadata.  They
      were holding these pages locked and were relying on fsync_inode_data_buffers
      for writing them out.  So fdatasync() deadlocked.
      
      I've changed JFS to not lock those pages.  Change was acked
      by Dave Kleikamp <shaggy@austin.ibm.com> as the right
      thing to do, but may not be complete.  Probably igrab()
      against ->host is needed to pin the address_space down.
      7d513234
    • Andrew Morton's avatar
      [PATCH] remove buffer head b_inode · df6867ef
      Andrew Morton authored
      Removal of buffer_head.b_inode.  The list_emptiness of b_inode_buffers
      is used to indicate whether the buffer is on an inode's
      i_dirty_buffers.
      df6867ef
    • Andrew Morton's avatar
      [PATCH] cleanup write_one_page · 59c39e8e
      Andrew Morton authored
      Remove writeout_one_page(), waitfor_one_page() and the now-unused
      generic_buffer_fdatasync().
      
      Add new
      
      	write_one_page(struct page *page, int wait)
      
      which is exported to modules.  Update callers to use that.  It's only
      used for IS_SYNC operations.
      59c39e8e
    • Andrew Morton's avatar
      [PATCH] cleanup page_swap_cache · cc6a392a
      Andrew Morton authored
      Removes some redundant BUG checks - trueness of PageSwapCache() implies
      that page->mapping is non-NULL, and we've already checked that.
      cc6a392a
    • Andrew Morton's avatar
      [PATCH] cleanup page flags · aa78091f
      Andrew Morton authored
      page->flags cleanup.
      
      Moves the definitions of the page->flags bits and all the PageFoo
      macros into linux/page-flags.h.  That file is currently included from
      mm.h, but the stage is set to remove that and include page-flags.h
      direct in all .c files which require that.  (120 of them).
      
      The patch also makes all the page flag macros and functions consistent:
      
      For PG_foo, the following functions are defined:
      
      	SetPageFoo
      	ClearPageFoo
      	TestSetPageFoo
      	TestClearPageFoo
      	PageFoo
      
      and that's it.
      
      - Page_Uptodate is renamed to PageUptodate
      
      - LockPage is removed.  All users updated to use SetPageLocked
      
      - UnlockPage is removed.  All callers updated to use unlock_page().
        it's a real function - there's no need to hide that fact.
      
      - PageTestandClearReferenced renamed to TestClearPageReferenced
      
      - PageSetSlab renamed to SetPageSlab
      
      - __SetPageReserved is removed.  It's an infinitesimally small
         microoptimisation, and is inconsistent.
      
      - TryLockPage is renamed to TestSetPageLocked
      
      - PageSwapCache() is renamed to page_swap_cache(), so it doesn't
        pretend to be a page->flags bit test.
      aa78091f
    • Andrew Morton's avatar
      [PATCH] minix directory handling · 68872e78
      Andrew Morton authored
      Convert minixfs directory code to not rely on the state of data outside
      i_size.
      68872e78
    • Andrew Morton's avatar
      [PATCH] remove buffer unused_list · 4beda7c1
      Andrew Morton authored
      Removes the buffer_head unused list.  Use a mempool instead.
      
      The reduced lock contention provided about a 10% boost on ANton's
      12-way.
      4beda7c1
    • Andrew Morton's avatar
      [PATCH] writeback from address spaces · 090da372
      Andrew Morton authored
      [ I reversed the order in which writeback walks the superblock's
        dirty inodes.  It sped up dbench's unlink phase greatly.  I'm
        such a sleaze ]
      
      The core writeback patch.  Switches file writeback from the dirty
      buffer LRU over to address_space.dirty_pages.
      
      - The buffer LRU is removed
      
      - The buffer hash is removed (uses blockdev pagecache lookups)
      
      - The bdflush and kupdate functions are implemented against
        address_spaces, via pdflush.
      
      - The relationship between pages and buffers is changed.
      
        - If a page has dirty buffers, it is marked dirty
        - If a page is marked dirty, it *may* have dirty buffers.
        - A dirty page may be "partially dirty".  block_write_full_page
          discovers this.
      
      - A bunch of consistency checks of the form
      
      	if (!something_which_should_be_true())
      		buffer_error();
      
        have been introduced.  These fog the code up but are important for
        ensuring that the new buffer/page code is working correctly.
      
      - New locking (inode.i_bufferlist_lock) is introduced for exclusion
        from try_to_free_buffers().  This is needed because set_page_dirty
        is called under spinlock, so it cannot lock the page.  But it
        needs access to page->buffers to set them all dirty.
      
        i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
      
      - fs/inode.c has been split: all the code related to file data writeback
        has been moved into fs/fs-writeback.c
      
      - Code related to file data writeback at the address_space level is in
        the new mm/page-writeback.c
      
      - try_to_free_buffers() is now non-blocking
      
      - Switches vmscan.c over to understand that all pages with dirty data
        are now marked dirty.
      
      - Introduces a new a_op for VM writeback:
      
      	->vm_writeback(struct page *page, int *nr_to_write)
      
        this is a bit half-baked at present.  The intent is that the address_space
        is given the opportunity to perform clustered writeback.  To allow it to
        opportunistically write out disk-contiguous dirty data which may be in other zones.
        To allow delayed-allocate filesystems to get good disk layout.
      
      - Added address_space.io_pages.  Pages which are being prepared for
        writeback.  This is here for two reasons:
      
        1: It will be needed later, when BIOs are assembled direct
           against pagecache, bypassing the buffer layer.  It avoids a
           deadlock which would occur if someone moved the page back onto the
           dirty_pages list after it was added to the BIO, but before it was
           submitted.  (hmm.  This may not be a problem with PG_writeback logic).
      
        2: Avoids a livelock which would occur if some other thread is continually
           redirtying pages.
      
      - There are two known performance problems in this code:
      
        1: Pages which are locked for writeback cause undesirable
           blocking when they are being overwritten.  A patch which leaves
           pages unlocked during writeback comes later in the series.
      
        2: While inodes are under writeback, they are locked.  This
           causes namespace lookups against the file to get unnecessarily
           blocked in wait_on_inode().  This is a fairly minor problem.
      
           I don't have a fix for this at present - I'll fix this when I
           attach dirty address_spaces direct to super_blocks.
      
      - The patch vastly increases the amount of dirty data which the
        kernel permits highmem machines to maintain.  This is because the
        balancing decisions are made against the amount of memory in the
        machine, not against the amount of buffercache-allocatable memory.
      
        This may be very wrong, although it works fine for me (2.5 gigs).
      
        We can trivially go back to the old-style throttling with
        s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
        balance_dirty_pages().  But better would be to allow blockdev
        mappings to use highmem (I'm thinking about this one, slowly).  And
        to move writer-throttling and writeback decisions into the VM (modulo
        the file-overwriting problem).
      
      - Drops 24 bytes from struct buffer_head.  More to come.
      
      - There's some gunk like super_block.flags:MS_FLUSHING which needs to
        be killed.  Need a better way of providing collision avoidance
        between pdflush threads, to prevent more than one pdflush thread
        working a disk at the same time.
      
        The correct way to do that is to put a flag in the request queue to
        say "there's a pdlfush thread working this disk".  This is easy to
        do: just generalise the "ra_pages" pointer to point at a struct which
        includes ra_pages and the new collision-avoidance flag.
      090da372
    • Andrew Morton's avatar
      [PATCH] readahead fix · 00d6555e
      Andrew Morton authored
      Changes the way in which the readahead code locates the readahead
      setting for the underlying device.
      
      - struct block_device and struct address_space gain a *pointer* to the
        current readahead tunable.
      
      - The tunable lives in the request queue and is altered with the
        traditional ioctl.
      
      - The value gets *copied* into the struct file at open() time.  So a
        fcntl() mode to modify it per-fd is simple.
      
      - Filesystems which are not request_queue-backed get the address of the
        global `default_ra_pages'.  If we want, this can become a tunable.
      
      - Filesystems are at liberty to alter address_space.ra_pages to point
        at some other fs-private default at new_inode/read_inode/alloc_inode
        time.
      
      - The ra_pages pointer can become a structure pointer if, at some time
        in the future, high-level code needs more detailed information about
        device characteristics.
      
        In fact, it'll need to become a struct pointer for use by
        writeback: my current writeback code has the problem that multiple
        pdflush threads can get stuck on the same request queue.  That's a
        waste of resources.  I currently have a silly flag in the superblock
        to try to avoid this.
      
        The proper way to get this exclusion is for the high-level
        writeback code to be able to do a test-and-set against a
        per-request_queue flag.  That flag can live in a structure alongside
        ra_pages, conveniently accessible at the pagemap level.
      
      One thing still to-be-done is going into all callers of blk_init_queue
      and blk_queue_make_request and making sure that they're setting up a
      sensible default.  ATA wants 248 sectors, and floppy drives don't want
      128kbytes, I suspect.  Later.
      00d6555e
    • Andrew Morton's avatar
      [PATCH] page accounting · d878155c
      Andrew Morton authored
      This patch provides global accounting of locked and dirty pages.  It
      does this via lightweight per-CPU data structures.  The page_cache_size
      accounting has been changed to use this facility as well.
      
      Locked and dirty page accounting is needed for making writeback and
      throttling decisions.
      
      The patch also starts to move code which is related to page->flags
      out of linux/mm.h and into linux/page-flags.h
      d878155c
    • Andrew Morton's avatar
      [PATCH] ext2 directory handling · aa4f3f28
      Andrew Morton authored
      Convert ext2 directory handling to not rely on the contents of pages
      outside i_size.
      
      This is because block_write_full_page (which is used for all writeback)
      zaps the page outside i_size.
      aa4f3f28
    • Andrew Morton's avatar
      [PATCH] page_alloc failure printk · 63b060c4
      Andrew Morton authored
      Emit a printk when a page allocation fails.  Considered useful for
      diagnosing crashes.
      63b060c4
  2. 29 Apr, 2002 2 commits
    • Alexander Viro's avatar
      [PATCH] Re: 2.5.11 breakage · 85d217f4
      Alexander Viro authored
      	OK, here comes.  Patch below is an attempt to do the fastwalk
      stuff in right way and so far it seems to be working.
      
       - dentry leak is plugged
       - locked/unlocked state of nameidata doesn't depend on history - it
         depends only on point in code.
       - LOOKUP_LOCKED is gone.
       - following mounts and .. doesn't drop dcache_lock
       - light-weight permission check distinguishes between "don't know" and
         "permission denied", so we don't call full-blown permission() unless
         we have to.
       - code that changes root/pwd holds dcache_lock _and_ write lock on
         current->fs->lock.  I.e. if we hold dcache_lock we can safely
         access our ->fs->{root,pwd}{,mnt}
       - __d_lookup() does not increment refcount; callers do dget_locked()
         if they need it (behaviour of d_lookup() didn't change, obviously).
       - link_path_walk() logics had been (somewhat) cleaned up.
      85d217f4
    • Martin Dalecki's avatar
      [PATCH] 2.5.10 IDE 45 · 7ca32047
      Martin Dalecki authored
      - Fix bogus set_multimode() change. I tough I had reverted it before diff-ing.
         This was causing hangs of /dev/hdparm -m8 /dev/hda and similar commands.
      7ca32047
  3. 28 Apr, 2002 20 commits