1. 21 Jul, 2002 17 commits
    • Alexander Viro's avatar
      [PATCH] fix for nfs_unlink and vfs_unlink · 17191a2c
      Alexander Viro authored
      Ugh.  nfs_unlink() is actually racy as hell - look what happens if
      we enter it with ->d_count == 1, see that nfs_sillyrename() doesn't
      need to do anything and call nfs_safe_remove().  In the meanwhile
      somebody does dcache lookup (without going into NFS code - plain
      and simple cache hit) and increments ->d_count.  nfs_safe_remove()
      decides that something is very rotten and fails.
      
      AFAICS we should take the test for ->d_count + d_drop if it's 1
      under dcache_lock (and in one place).  Comments?
      
      Proposed fix follows:
      	* dget/dput killed in vfs_unlink() (safely)
      	* nfs_unlink() starts with check for ->d_count (under dcache_lock)
      	* if it's > 1 - sillyrename
      	* if it is 1 - immediately unhash, then drop dcache_lock.
      		after that we do as in old variant, except that
      		rehashing is done in nfs_unlink() and only if there
      		was an error - if there was none we simply leave
      		d_delete() to vfs_unlink().
      17191a2c
    • Robert Love's avatar
      [PATCH] Re: "big IRQ lock" removal docs · 30034e56
      Robert Love authored
      One more doc correction while we are at it...
      30034e56
    • Linus Torvalds's avatar
      ee816b81
    • Ingo Molnar's avatar
      [PATCH] "big IRQ lock" removal docs · 6fdf2906
      Ingo Molnar authored
      i've done a minor comment update in softirq.c, plus i've written a
      cli-sti-removal.txt guide to help driver writers do the transition.
      6fdf2906
    • Russell King's avatar
      [PATCH] Serial driver stuff · 33c0d1b0
      Russell King authored
      The serial layer is restructured to allow less code duplication (and
      hence bug duplication) across various serial drivers.  Since ARM adds
      six extra serial drivers, maintaining six copies of serial.c was not
      my idea of fun.
      
      Therefore, we've ended up with a core serial driver, which knows about
      the interactions with the tty layer, and low-level hardware drivers,
      which know all about the hardware.  The interface between the two is
      described in "Documentation/serial/driver".
      
      This patch completely removes the old serial.c driver and its associated
      configuration options, as you requested at KS2002.  We keep a certain
      amount of configuration compatibility with the per-architecture serial.h
      file for the moment; this *will* be killed in the next round of patches.
      The biggest user of this is x86, and since I don't have an x86 box to
      test this stuff on, I think the changes are best kept separate.
      33c0d1b0
    • Ingo Molnar's avatar
      [PATCH] "big IRQ lock" removal, IRQ cleanups · ae86a80a
      Ingo Molnar authored
      This is a massive cleanup of the IRQ subsystem.  It's losely based on
      Linus' original idea and DaveM's original implementation, to fold our
      various irq, softirq and bh counters into the preemption counter.
      
      with this approach it was possible:
      
       - to remove the 'big IRQ lock' on SMP - on which sti() and cli() relied.
      
       - to streamline/simplify arch/i386/kernel/irq.c significantly.
      
       - to simplify the softirq code.
      
       - to remove the preemption count increase/decrease code from the lowlevel
         IRQ assembly code.
      
       - to speed up schedule() a bit.
      
      Global sti() and cli() is gone forever on SMP, there is no more globally
      synchronizing irq-disabling capability.  All code that relied on sti()
      and cli() and restore_flags() must use other locking mechanisms from now
      on (spinlocks and __cli()/__sti()).
      
      obviously this patch breaks massive amounts of code, so only limited
      .configs are working at the moment (UP is expected to be unaffected, but
      SMP will require various driver updates).
      
      The patch was developed and tested on SMP systems, and while the code is
      still a bit rough in places, the base IRQ code appears to be pretty
      robust and clean.
      
      while it boots already so the worst is over, there is lots of work left:
      eg. to fix the serial layer to not use cli()/sti() and bhs ...
      ae86a80a
    • Alexander Viro's avatar
      [PATCH] jffs kdev_t cleanups · 3d37e1e6
      Alexander Viro authored
      In the /proc/fs/jffs/* code we switch to passing number of mtd device
      (as an integer) instead of messing with kdev_t (which would always be
      mk_kdev(MTD_BLOCK_MAJOR, device_number) anyway).
      3d37e1e6
    • Alexander Viro's avatar
      [PATCH] removal of dead prototypes · b1cd8879
      Alexander Viro authored
      Removed prototypes of several functions that do not exist...
      b1cd8879
    • Alexander Viro's avatar
      [PATCH] SCSI ->bios_param() switched to struct block_device * · 25e560ad
      Alexander Viro authored
      ->bios_param() switched from kdev_t to struct block_device *.
      
      Caller and all instances updated.
      25e560ad
    • Alexander Viro's avatar
      [PATCH] paride cleanup and fixes · b92b31a3
      Alexander Viro authored
      somewhat related to the above - drivers/block/paride/* switched to
      module_init()/module_exit(), pd.c taught to use LBA if disks support
      it (needed for paride disks >8Gb; change is fairly trivial and I've
      got 40Gb one ;-)
      b92b31a3
    • Alexander Viro's avatar
      [PATCH] blk_ioctl() not exported anymore · 9d16ed71
      Alexander Viro authored
      blk_ioctl() not exported anymore; calls moved from drivers to block_dev.c.
      9d16ed71
    • Alexander Viro's avatar
      [PATCH] partition handling locking cleanups · 81d4c00c
      Alexander Viro authored
      Horrors with open/reread_partition exclusion are starting to get fixed.
      
      It's not the final variant, but at least we are getting the logics into
      one place; switch to final variant will happen once we get per-disk
      analog of gendisks.  New fields - ->bd_part_sem and ->bd_part_count.
      
      The latter counts the amount of opened partitions.  The former protects
      said count _and_ is held while we are rereading partition tables.
      Helpers - dev_part_lock()/dev_part_unlock() (currently taking kdev_t; that
      will change pretty soon).  No more ->open() and ->release() for partitions,
      all that logics went to generic code.  Lock hierachy is currently messy:
      
        ->bd_sem for partitions -> ->bd_part_sem -> ->bd_sem for entire disks
      
      Ugly, but that'll go away and to get the final variant of locking right
      now would take _really_ big patch - with a lot of steps glued together.
      The damn thing is large as it is...
      81d4c00c
    • Alexander Viro's avatar
      [PATCH] block device size cleanups · 5844ac33
      Alexander Viro authored
      for partitioned devices we use ->nr_sect to find the size; blk_size[] is
      still used for things like floppy.c, etc.; that will go away later.
      
      There was only one place (do_open()) that needed it - the rest uses
      ->bd_inode->i_size now.  So blkdev_size_in_bytes() is gone - it's
      expanded in its only caller.  Same place (do_open()) finds the partition
      offset and stores it in new field ->bd_offset.  As the result, call of
      get_gendisk() is gone from the IO path - in blk_partition_remap() we
      just add ->bd_offset.
      
      Additionally, we take driver probing (get_blkfops()) outside of ->bd_sem
      (again, do_open()) - that will allow to kill ad-hackery in check_partitions()
      (opening bdev by hand).
      5844ac33
    • Alexander Viro's avatar
      [PATCH] partition parsing cleanup · a22f8253
      Alexander Viro authored
      struct gendisk and partition parsers divorced; all these parsers (IBM style,
      disklabel, etc.) just fill the structure they get from check_partitions().
      
      Actual setting the things up (filling hd_struct arrays, telling RAID that
      we had found partitions worth a look, etc.) is taken into check_partitions()
      and done only when we are done with parsing.  Parsers don't know (or care)
      what majors/minors they are dealing with; that knowledge also went to
      check_partitions().
      a22f8253
    • Alexander Viro's avatar
      [PATCH] Use wipe_partitions() where appropriate · d6d4f980
      Alexander Viro authored
      a bunch of places doing invalidate_device() either didn't need it at all
      or actually wanted wipe_partitions().  Switched.
      d6d4f980
    • Alexander Viro's avatar
      [PATCH] make hfs use regular semaphores · a4dea1b6
      Alexander Viro authored
      unrelated to the rest, replaces home-grown (racy) semaphores in fs/hfs
      with the real thing.
      a4dea1b6
    • Linus Torvalds's avatar
      Fix incoherent LDT at mmap exit. · f6daaf1a
      Linus Torvalds authored
      We should _not_ update the current LDT if it's not the current
      MM that we are tearing down.
      f6daaf1a
  2. 20 Jul, 2002 3 commits
  3. 19 Jul, 2002 20 commits
    • Greg Kroah-Hartman's avatar
      LSM: for now, always set CONFIG_SECURITY_CAPABILITIES to y · 7a19fd4a
      Greg Kroah-Hartman authored
      This can be overridden by editing the .config file if you really want it.
      7a19fd4a
    • Linus Torvalds's avatar
      Merge bk://lsm.bkbits.net/linus-2.5 · 3bfd74ba
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      3bfd74ba
    • Greg Kroah-Hartman's avatar
    • Greg Kroah-Hartman's avatar
      LSM: Add all of the new security/* files for basic task control · 2b15fe63
      Greg Kroah-Hartman authored
      This includes the security_* functions, and the default and capability
      modules.
      2b15fe63
    • Greg Kroah-Hartman's avatar
      LSM: change BUS_ISA to CTL_BUS_ISA to prevent namespace collision with the input subsystem. · c59ccd5f
      Greg Kroah-Hartman authored
      This is needed due to the next header file changes.
      c59ccd5f
    • Hirofumi Ogawa's avatar
      [PATCH] Add 4G-1 file support to FAT32 · d4db5063
      Hirofumi Ogawa authored
      This patch changes cont_prepare_write(), in order to support a 4G-1
      file for FAT32.
      
       int cont_prepare_write(struct page *page, unsigned offset,
      -		unsigned to, get_block_t *get_block, unsigned long *bytes)
      +		unsigned to, get_block_t *get_block, loff_t *bytes)
      
      And it fixes broken adfs/affs/fat/hfs/hpfs/qnx4 by this
      cont_prepare_write() change.
      d4db5063
    • Linus Torvalds's avatar
      Merge http://linuxusb.bkbits.net/linus-2.5 · 047cef32
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      047cef32
    • Andrew Morton's avatar
      [PATCH] readahead optimisations · b6938a7b
      Andrew Morton authored
      Been looking at a workload which involves several processes which seek
      around and read from a large file.  There are a few problems:
      generic_file_lseek is bouncing i_sem around like mad, and readahead is
      doing lots of pointless pagecache probing.
      
      This patch addresses readahead.
      
      Presumably the change will be larger on machines which have higher
      bandwidth memory than my test box, of which there are many.
      
      This patch teaches readahead to detect the situation where no IO is
      actually being performed as a result of its actions.  Now, we don't
      want to sacrifice IO efficiency to save a bit of CPU, so the code is
      very cautious.  But eventually, after some tens of consecutive
      readahead attempts were found to perform no I/O at all, readahead will
      turn itself off.
      
      readahead will be turned on again when either generic_file_read() or
      filemap_nopage() get a pagecache miss.  The function
      handle_ra_thrashing() has been renamed to handle_ra_miss() to reflect
      its widened role.
      
      A performance bug in page_cache_readround() was fixed - if
      ra->next_size is zero, that function needs to leave it well alone,
      because next_size==0 is a magic value meaning that the file has just
      been opened and that readahead needs to get aggressive.  This change
      makes a `make dep' run at the same speed as in the 2.4 kernel.  It used
      to take 4x as long...
      
      `make dep' is an interesting test because it uses mmap to read the files.
      b6938a7b
    • Andrew Morton's avatar
      [PATCH] writeback scalability improvements · e64fa3db
      Andrew Morton authored
      The kernel has a number of problems wrt heavy write traffic to multiple
      spindles.  What keeps on happening is that all processes which are
      responsible for writeback get blocked on one of the queues and all the
      others fall idle.
      
      This happens in the balance_dirty_pages() path (balance_dirty() in 2.4)
      and in the page reclaim code, when a dirty page is found on the LRU.
      
      The latter is particularly bad because it causes "innocent" processes
      to be suspended for long periods due to the activity of heavy writers.
      
      The general idea is: the primary resource for writeback should be the
      process which is dirtying memory.  The secondary resource is the
      pdflush pool (although this is mainly for providing async writeback in
      the presence of light-moderate loads).  Add the final
      oh-gee-we-screwed-up resource for writeback is a caller to
      shrink_cache().
      
      This patch addresses the balance_dirty_pages() path.  This code was
      initially modelled on the 2.4 writeback scheme: throttled processes
      writeback all data regardless of its queue.  Instead, the patch changes
      it so that the balance_dirty_pages() caller only writes back pages
      which are dirty against the queue which that caller just dirtied.
      
      So the effect is a better allocation of writeback resources across the
      queues and increased parallelism.
      
      The per-queue writeback is implemented by using
      mapping->backing_dev_info as a search key during the walk across the
      superblocks and inodes.
      
      The patch also fixes an initialisation problem in
      block_dev.c:do_open(): it was setting up the blockdev's
      mapping->backing_dev_info too early, before the queue has been
      identified.
      
      Generally, this patch doesn't help much, because of the stalls in the
      page allocator.  I have a patch which mostly fixes that up, and taken
      together the kernel is achieving almost platter speed against six
      spindles, but only when the system has a small amount of memory.  More
      work is needed there.
      e64fa3db
    • Andrew Morton's avatar
      [PATCH] remove add_to_page_cache_unique() · cad46d66
      Andrew Morton authored
      A tasty patch from Hugh Dickens.  radix_tree_insert() fails if something
      was already present at the target index, so that error can be
      propagated back through add_to_page_cache().  Hence
      add_to_page_cache_unique() is obsolete.
      
      Hugh's patch removes add_to_page_cache_unique() and cleans up a bunch of
      stuff.
      cad46d66
    • Andrew Morton's avatar
      [PATCH] direct_io mopup · e3339bee
      Andrew Morton authored
      Some cleanup from the surprise direct-to-bio for O_DIRECT merge.
      
      - Remove bits and pieces from the kiobuf implementation
      
      - Replace the waitqueue in struct dio with just a task_struct pointer
        and use wake_up_process.  (Ben).
      
      - Only take mmap_sem around the individual calls to get_user_pages().
         (It pins the vmas, yes?)
      
      - Remove some debug code.
      
      - Fix JFS.
      e3339bee
    • Andrew Morton's avatar
      [PATCH] alloc_pages cleanup · 4504a57e
      Andrew Morton authored
      Cleanup patch from Martin Bligh: convert some loops which want to be
      `for' loops into that, and add some commentary.
      4504a57e
    • Andrew Morton's avatar
      [PATCH] inline generic_writepages() · 15a37ba2
      Andrew Morton authored
      generic_writepages() is just a wrapper around mpage_writepages(), so
      inline it.
      15a37ba2
    • Andrew Morton's avatar
      [PATCH] restore CHECK_EMERGENCY_SYNC. Again. · 3d4ed856
      Andrew Morton authored
      Put the CHECK_EMERGENCY_SYNC back into the kupdate function.  I seem to
      keep removing it.
      3d4ed856
    • Andrew Morton's avatar
      [PATCH] O_DIRECT open check · 7d0be429
      Andrew Morton authored
      Updated forward-port of Aodrea's O_DIRECT open() checks.  If the user
      asked for O_DIRECT and the inode has no mapping or no a_ops then fail
      the open up-front.
      7d0be429
    • Andrew Morton's avatar
      [PATCH] VM instrumentation · e177ea28
      Andrew Morton authored
      A patch from Rik which adds some operational statitics to the VM.
      
      In /proc/meminfo:
      
      PageTables:	Amount of memory used for process pagetables
      PteChainTot:	Amount of memory allocated for pte_chain objects
      PteChainUsed:	Amount of memory currently in use for pte chains.
      
      In /proc/stat:
      
      pageallocs:	Number of pages allocated in the page allocator
      pagefrees:	Number of pages returned to the page allocator
      
      		(These can be used to measure the allocation rate)
      
      pageactiv:	Number of pages activated (moved to the active list)
      pagedeact:	Number of pages deactivated (moved to the inactive list)
      pagefault:	Total pagefaults
      majorfault:	Major pagefaults
      pagescan:	Number of pages which shrink_cache looked at
      pagesteal:	Number of pages which shrink_cache freed
      pageoutrun:	Number of calls to try_to_free_pages()
      allocstall:	Number of calls to balance_classzone()
      
      
      Rik will be writing a userspace app which interprets these things.
      
      The /proc/meminfo stats are efficient, but the /proc/stat accumulators
      will cause undesirable cacheline bouncing.  We need to break the disk
      statistics out of struct kernel_stat and make everything else in there
      per-cpu.  If that doesn't happen in time for 2.6 then we disable
      KERNEL_STAT_INC().
      e177ea28
    • Andrew Morton's avatar
      [PATCH] avoid allocating pte_chains for unshared pages · 6a2ea338
      Andrew Morton authored
      Patch from David McCracken.  It is an optimisation to the rmap
      pte_chains.
      
      In the common case where a page is mapped by only a single pte, we
      don't need to allocate a pte_chain structure.  Just make the page's
      pte_chain pointer point straight at that pte and flag this with
      PG_direct.
      6a2ea338
    • Andrew Morton's avatar
      [PATCH] leave truncate's orphaned pages on the LRU · fa08cc83
      Andrew Morton authored
      Fix to the page reclaim code from Rik.
      
      Anonymous pages which have buffers arise when
      truncate_complete_page()'s call to ->releasepage() failed.  Those pages
      may still be mapped into process address spaces.
      
      We should not remove them from the LRU, because that makes them
      unswappable and they hang around until process exit.
      fa08cc83
    • Andrew Morton's avatar
      [PATCH] minimal rmap · c48c43e6
      Andrew Morton authored
      This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig
      Kulsea.
      
      Basically,
      
      before: When the page reclaim code decides that is has scanned too many
      unreclaimable pages on the LRU it does a scan of process virtual
      address spaces for pages to add to swapcache.  ptes pointing at the
      page are unmapped as the scan proceeds.  When all ptes referring to a
      page have been unmapped and it has been written to swap the page is
      reclaimable.
      
      after: When an anonymous page is encountered on the tail of the LRU we
      use the rmap to see if it hasn't been referenced lately.  If so then
      add it to swapcache.  When the page is again encountered on the LRU, if
      it is still unreferenced then try to unmap all ptes which refer to it
      in one hit, and if it is clean (ie: on swap) then free it.
      
      The rest of the VM - list management, the classzone concept, etc
      remains unchanged.
      
      There are a number of things which the per-page pte chain could be
      used for.  Bill Irwin has identified the following.
      
      
      (1)  page replacement no longer goes around randomly unmapping things
      
      (2)  referenced bits are more accurate because there aren't several ms
              or even seconds between find the multiple pte's mapping a page
      
      (3)  reduces page replacement from O(total virtually mapped) to O(physical)
      
      (4)  enables defragmentation of physical memory
      
      (5)  enables cooperative offlining of memory for friendly guest instance
              behavior in UML and/or LPAR settings
      
      (6)  demonstrable benefit in performance of swapping which is common in
              end-user interactive workstation workloads (I don't like the word
              "desktop"). c.f. Craig Kulesa's post wrt. swapping performance
      
      (7)  evidence from 2.4-based rmap trees indicates approximate parity
              with mainline in kernel compiles with appropriate locking bits
      
      (8)  partitioning of physical memory can reduce the complexity of page
              replacement searches by scanning only the "interesting" zones
              implemented and merged in 2.4-based rmap
      
      (9)  partitioning of physical memory can increase the parallelism of page
              replacement searches by independently processing different zones
              implemented, but not merged in 2.4-based rmap
      
      (10) the reverse mappings may be used for efficiently keeping pte cache
              attributes coherent
      
      (11) they may be used for virtual cache invalidation (with changes)
      
      (12) the reverse mappings enable proper RSS limit enforcement
              implemented and merged in 2.4-based rmap
      
      
      
      The code adds a pointer to struct page, consumes additional storage for
      the pte chains and adds computational expense to the page reclaim code
      (I measured it at 3% additional load during streaming I/O).  The
      benefits which we get back for all this are, I must say, theoretical
      and unproven.  If it has real advantages (or, indeed, disadvantages)
      then why has nobody demonstrated them?
      
      
      
      There are a number of things remaining to be done:
      
      1: Demonstrate the above advantages.
      
      2: Make it work with pte-highmem  (Bill Irwin is signed up for this)
      
      3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's
         patch does this)
      
      4: Move the pte_chains into highmem too (Bill, I guess)
      
      5: per-cpu pte_chain freelists (Rik?)
      
      6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)
      
      7: multithread the page reclaim code.  (I have patches).
      
      8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
         often well-ordered-by-virtual-address on the LRU, so it "just
         works" for benchmarky loads.  But there may be some other loads...
      
      9: Fix bad IO latency in page reclaim (I have lame patches)
      
      10: Develop tuning tools, use them.
      
      11: The nightly updatedb run is still evicting everything.
      c48c43e6
    • Linus Torvalds's avatar
      Merge bk://lsm.bkbits.net/linus-2.5 · b15d45bf
      Linus Torvalds authored
      into home.transmeta.com:/home/torvalds/v2.5/linux
      b15d45bf