An error occurred fetching the project authors.
- 29 Jun, 2004 2 commits
-
-
Linus Torvalds authored
I don't think we're in K&R any more, Toto. If you want a NULL pointer, use NULL. Don't use an integer. Most of the users really didn't seem to know the proper type.
-
Anton Altaparmakov authored
I noticed that fs/buffer.c::drop_buffers() contains some code that AFAICS doesn't actually do anything other than waste cpu cycles so here is patch to remove it... The local variable was_uptodate is being messed with but it is not being read anywhere so it seems entirely pointless. I assume this must be a remainder from old code which mucked around with the page uptodateness but which has since been (re-)moved. Signed-off-by:
Anton Altaparmakov <aia21@cantab.net> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 22 May, 2004 2 commits
-
-
Andrew Morton authored
Many places do: if (kmem_cache_create(...) == NULL) panic(...); We can consolidate all that by passing another flag to kmem_cache_create() which says "panic if it doesn't work".
-
Andrew Morton authored
Go back to the 2.6.5 concepts, with rmap additions. In particular: - Implement Andrea's flavour of page_mapping(). This function opaquely does the right thing for pagecache pages, anon pages and for swapcache pages. The critical thing here is that page_mapping() returns &swapper_space for swapcache pages without actually requiring the storage at page->mapping. This frees page->mapping for the anonmm/anonvma metadata. - Andrea and Hugh placed the pagecache index of swapcache pages into page->private rather than page->index. So add new page_index() function which hides this. - Make swapper_space.set_page_dirty() again point at __set_page_dirty_buffers(). If we don't do that, a bare set_page_dirty() will fall through to __set_page_dirty_buffers(), which is silly. This way, __set_page_dirty_buffers() can continue to use page->mapping. It should never go near anon or swapcache pages. - Give swapper_space a ->set_page_dirty address_space_operation method, so that set_page_dirty() will not fall through to __set_page_dirty_buffers() for swapcache pages. That function is not set up to handle them. The main effect of these changes is that swapcache pages are treated more similarly to pagecache pages. And we are again tagging swapcache pages as dirty in their radix tree, which is a requirement if we later wish to implement swapcache writearound based on tagged radix-tree walks.
-
- 21 May, 2004 1 commit
-
-
Andrew Morton authored
We keep on getting BUG()s from isofs_read_super() because it passes an insane blocksize to bread(). See http://bugme.osdl.org/show_bug.cgi?id=2735 for example. I don't know what's up with isofs, but going BUG in there seems a bit rude. Change it to drop a bunch of diagnostics and a backtrace then return a null bh*. Most callers of getblk() don't expect it to fail, so they'll oops anyway. But isofs does actually check for a NULL return. This way, the machine stays up and we get better debug diagnostics.
-
- 19 May, 2004 2 commits
-
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> The IA64 hotplug CPU merge seems to have included some core changes: in particular the recalc_bh_state() needs to sum for all (including offline) cpus, since we don't empty the counters on CPU down. The totals printed by /proc/stat (the first loop) should include offline cpus, too (apparently printing out the per-cpu lines for offline cpus confuses top).
-
Andrew Morton authored
blk_run_page() is incorrectly using page->mapping, which makes it racy against removal from swapcache. Make block_sync_page() use page_mapping(), and remove bkl_run_page(), which only had one caller.
-
- 14 May, 2004 4 commits
-
-
Andrew Morton authored
From: Paul Jackson <pj@sgi.com> With a hotplug capable kernel, there is a requirement to distinguish a possible CPU from one actually present. The set of possible CPU numbers doesn't change during a single system boot, but the set of present CPUs changes as CPUs are physically inserted into or removed from a system. The cpu_possible_map does not change once initialized at boot, but the cpu_present_map changes dynamically as CPUs are inserted or removed. Paul Jackson <pj@sgi.com> provided an expanded explanation: Ashok's cpu hot plug patch adds a cpu_present_map, resulting in the following cpu maps being available. All the following maps are fixed size bitmaps of size NR_CPUS. #ifdef CONFIG_HOTPLUG_CPU cpu_possible_map - map with all NR_CPUS bits set cpu_present_map - map with bit 'cpu' set iff cpu is populated cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler #else cpu_possible_map - map with bit 'cpu' set iff cpu is populated cpu_present_map - copy of cpu_possible_map cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler #endif In either case, NR_CPUS is fixed at compile time, as the static size of these bitmaps. The cpu_possible_map is fixed at boot time, as the set of CPU id's that it is possible might ever be plugged in at anytime during the life of that system boot. The cpu_present_map is dynamic(*), representing which CPUs are currently plugged in. And cpu_online_map is the dynamic subset of cpu_present_map, indicating those CPUs available for scheduling. If HOTPLUG is enabled, then cpu_possible_map is forced to have all NR_CPUS bits set, otherwise it is just the set of CPUs that ACPI reports present at boot. If HOTPLUG is enabled, then cpu_present_map varies dynamically, depending on what ACPI reports as currently plugged in, otherwise cpu_present_map is just a copy of cpu_possible_map. (*) Well, cpu_present_map is dynamic in the hotplug case. If not hotplug, it's the same as cpu_possible_map, hence fixed at boot.
-
Andrew Morton authored
We don't trust bh->b_page to point to the right thing across all filesystems, so revert this bit.
-
Andrew Morton authored
From: Andrea Arcangeli <andrea@suse.de> From: Jens Axboe Add blk_run_page() API. This is so that we can pass the target page all the way down to (for example) the swap unplug function. So swap can work out which blockdevs back this particular page.
-
Andrew Morton authored
From: William Lee Irwin III <wli@holomorphy.com> This patch implements wake-one semantics for buffer_head wakeups in a single step. The buffer_head being waited on is passed to the waiter's wakeup function by the waker, and the wakeup function compares that to the a pointer stored in its on-stack structure and checking the readiness of the bit there also. Wake-one semantics are achieved by using WQ_FLAG_EXCLUSIVE in the codepaths waiting to acquire the bit for mutual exclusion.
-
- 22 Apr, 2004 1 commit
-
-
Andrew Morton authored
If a filesystem's ->writepage implementation repeatedly refuses to write the page (it keeps on redirtying it instead) (reiserfs seems to do this) then the writeback logic can get stuck repeately trying to write the same page. Fix that up by correctly setting wbc->pages_skipped, to tell the writeback logic that things aren't working out.
-
- 21 Apr, 2004 1 commit
-
-
Andrew Morton authored
From: Christoph Hellwig <hch@lst.de> These are the generic lockfs bits. Basically it takes the XFS freezing statemachine into the VFS. It's all behind the kernel-doc documented freeze_bdev and thaw_bdev interfaces. Based on an older patch from Chris Mason.
-
- 17 Apr, 2004 2 commits
-
-
Andrew Morton authored
From: Jeff Garzik <jgarzik@pobox.com> It was debug code, no longer required.
-
Andrew Morton authored
From: Jeff Garzik <jgarzik@pobox.com> Nobody ever checks the return value of submit_bh(), and submit_bh() is the only caller that checks the submit_bio() return value. This changes the kernel I/O submission path -- a fast path -- so this cleanup is also a microoptimization.
-
- 12 Apr, 2004 10 commits
-
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Tracking anonymous pages by anon_vma,pgoff or mm,address needs a pointer,offset pair in struct page: mapping,index the natural choice. But swapcache uses those for &swapper_space,swp_entry_t. It's trivial to separate swapcache from pagecache with radix tree; most of swapper_space is actually unused, just a fiction to pretend swap like file; and page->private is a good place to keep swp_entry_t, now that swap never uses bufferheads. Define PG_anon bit, page_add_rmap SetPageAnon and put an oopsable address in page->mapping to test that we're not confused by it. Define page_mapping(page) macro to give NULL when PageAnon, whatever may be in page->mapping. Define PG_swapcache bit, deduce swapper_space from that in the few places we need it. add_to_swap_cache now distinct from add_to_page_cache. Separating the caches somewhat simplifies the tmpfs swizzling in swap_state.c, now the page can briefly be in both caches. The rmap method remains pte chains, no change to that yet. But one small functional difference: the use of PageAnon implies that a page truncated while still mapped will no longer be found and freed (swapped out) by try_to_unmap, will only be freed by exit or munmap. But normally pages are unmapped by vmtruncate: this should only affect nonlinear mappings, and a later patch not in this batch will fix that.
-
Andrew Morton authored
From: Jens Axboe <axboe@suse.de>, Chris Mason, me, others. The global unplug list causes horrid spinlock contention on many-disk many-CPU setups - throughput is worse than halved. The other problem with the global unplugging is of course that it will cause the unplugging of queues which are unrelated to the I/O upon which the caller is about to wait. So what we do to solve these problems is to remove the global unplug and set up the infrastructure under which the VFS can tell the block layer to unplug only those queues which are relevant to the page or buffer_head whcih is about to be waited upon. We do this via the very appropriate address_space->backing_dev_info structure. Most of the complexity is in devicemapper, MD and swapper_space, because for these backing devices, multiple queues may need to be unplugged to complete a page/buffer I/O. In each case we ensure that data structures are in place to permit us to identify all the lower-level queues which contribute to the higher-level backing_dev_info. Each contributing queue is told to unplug in response to a higher-level unplug. To simplify things in various places we also introduce the concept of a "synchronous BIO": it is tagged with BIO_RW_SYNC. The block layer will perform an immediate unplug when it sees one of these go past.
-
Andrew Morton authored
From: Chris Mason <mason@suse.com> reiserfs data=ordered support.
-
Andrew Morton authored
From: Bart Samwel <bart@samwel.tk> Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop". In this mode the kernel will attempt to avoid spinning disks up. Algorithm: the idea is to hold dirty data in memory for a long time, but to flush everything which has been accumulated if the disk happens to spin up for other reasons. - Whenever a disk request completes (read or write), schedule a timer a few seconds hence. If the timer was already pending, reset it to a few seconds hence. - When the timer expires, write back the whole world. We use sync_filesystems() for this because it will force ext3 journal commits as well. - In balance_dirty_pages(), kick off background writeback when we hit the high threshold (dirty_ratio), not when we hit the low threshold. This has the effect of causing "lumpy" writeback which is something I spent a year fixing, but in laptop mode, it is desirable. - In try_to_free_pages(), only kick pdflush if the VM is getting into distress: we want to keep scanning for clean pages, deferring writeback. - In page reclaim, avoid writing back the odd random dirty page off the LRU: only start I/O if the scanning is working harder. The effect is to perform a sync() a few seconds after all I/O has ceased. The value which was written into /proc/sys/vm/laptop-mode determines, in seconds, the delay between the final I/O and the flush. Additionally, the patch adds tools which help answer the question "why the heck does my disk spin up all the time?". The user may set /proc/sys/vm/block_dump to a non-zero value and the kernel will print out information which will identify the process which is performing disk reads or which is dirtying pagecache. The user should probably disable syslogd before setting block-dump.
-
Andrew Morton authored
If pdflush hits a locked-and-clean buffer in __block_write_full_page() it will just pass over the buffer. Typically the buffer is an ext3 data=ordered buffer which is being written by kjournald, but a similar thing can happen with blockdev buffers and ll_rw_block(). This is bad because the buffer is still under I/O and a subsequent fsync's fdatawait() needs to know about it. It is not practical to tag the page for writeback - only the submitter of the I/O can do that, because the submitter has control of the end_io handler. So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on the underway I/O. There is a risk that pdflush::background_writeout() will lock up, repeatedly trying and failing to write the same page. This is prevented by ensuring that background_writeout() always throttles when it made no progress.
-
Andrew Morton authored
Move everything over to walking the radix tree via the PAGECACHE_TAG_DIRTY tag. Remove address_space.dirty_pages.
-
Andrew Morton authored
Arrange for under-writeback pages to be marked thus in their pagecache radix tree.
-
Andrew Morton authored
Arrange for all dirty pagecache pages to be tagged as dirty within their radix tree.
-
Andrew Morton authored
Intro to these patches: - Major surgery against the pagecache, radix-tree and writeback code. This work is to address the O_DIRECT-vs-buffered data exposure horrors which we've been struggling with for months. As a side-effect, 32 bytes are saved from struct inode and eight bytes are removed from struct page. At a cost of approximately 2.5 bits per page in the radix tree nodes on 4k pagesize, assuming the pagecache is densely populated. Not all pages are pagecache; other pages gain the full 8 byte saving. This change will break any arch code which is using page->list and will also break any arch code which is using page->lru of memory which was obtained from slab. The basic problem which we (mainly Daniel McNeil) have been struggling with is in getting a really reliable fsync() across the page lists while other processes are performing writeback against the same file. It's like juggling four bars of wet soap with your eyes shut while someone is whacking you with a baseball bat. Daniel pretty much has the problem plugged but I suspect that's just because we don't have testcases to trigger the remaining problems. The complexity and additional locking which those patches add is worrisome. So the approach taken here is to remove the page lists altogether and replace the list-based writeback and wait operations with in-order radix-tree walks. The radix-tree code has been enhanced to support "tagging" of pages, for later searches for pages which have a particular tag set. This means that we can ask the radix tree code "find me the next 16 dirty pages starting at pagecache index N" and it will do that in O(log64(N)) time. This affects I/O scheduling potentially quite significantly. It is no longer the case that the kernel will submit pages for I/O in the order in which the application dirtied them. We instead submit them in file-offset order all the time. This is likely to be advantageous when applications are seeking all over a large file randomly writing small amounts of data. I haven't performed much benchmarking, but tiobench random write throughput seems to be increased by 30%. Other tests appear to be unaltered. dbench may have got 10-20% quicker, but it's variable. There is one large file which everyone seeks all over randomly writing small amounts of data: the blockdev mapping which caches filesystem metadata. The kernel's IO submission patterns for this are now ideal. Because writeback and wait-for-writeback use a tree walk instead of a list walk they are no longer livelockable. This probably means that we no longer need to hold i_sem across O_SYNC writes and perhaps fsync() and fdatasync(). This may be beneficial for databases: multiple processes writing and syncing different parts of the same file at the same time can now all submit and wait upon writes to just their own little bit of the file, so we can get a lot more data into the queues. It is trivial to implement a part-file-fdatasync() as well, so applications can say "sync the file from byte N to byte M", and multiple applications can do this concurrently. This is easy for ext2 filesystems, but probably needs lots of work for data-journalled filesystems and XFS and it probably doesn't offer much benefit over an i_semless O_SYNC write. These patches can end up making ext3 (even) slower: for i in 1 2 3 4 do dd if=/dev/zero of=$i bs=1M count=2000 & done runs awfully slow on SMP. This is, yet again, because all the file blocks are jumbled up and the per-file linear writeout causes tons of seeking. The above test runs sweetly on UP because the on UP we don't allocate blocks to different files in parallel. Mingming and Badari are working on getting block reservation working for ext3 (preallocation on steroids). That should fix ext3 up. This patch: - Later, we'll need to access the radix trees from inside disk I/O completion handlers. So make mapping->page_lock irq-safe. And rename it to tree_lock to reliably break any missed conversions.
-
Andrew Morton authored
Fix a race which was identified by Daniel McNeil <daniel@osdl.org> If a buffer_head is under I/O due to JBD's ordered data writeout (which uses ll_rw_block()) then either filemap_fdatawrite() or filemap_fdatawait() need to wait on the buffer's existing I/O. Presently neither will do so, because __block_write_full_page() will not actually submit any I/O and will hence not mark the page as being under writeback. The best-performing fix would be to somehow mark the page as being under writeback and defer waiting for the ll_rw_block-initiated I/O until filemap_fdatawait()-time. But this is hard, because in __block_write_full_page() we do not have control of the buffer_head's end_io handler. Possibly we could make JBD call into end_buffer_async_write(), but that gets nasty. This patch makes __block_write_full_page() wait for any buffer_head I/O to complete before inspecting the buffer_head state. It only does this in the case where __block_write_full_page() was called for a "data-integrity" write: (wbc->sync_mode != WB_SYNC_NONE). Probably it doesn't matter, because kjournald is currently submitting (or has already submitted) all dirty buffers anyway.
-
- 19 Mar, 2004 1 commit
-
-
Rusty Russell authored
Various files keep per-cpu caches which need to be freed/moved when a CPU goes down. All under CONFIG_HOTPLUG_CPU ifdefs. scsi.c: drain dead cpu's scsi_done_q onto this cpu. buffer.c: brelse the bh_lrus queue for dead cpu. timer.c: migrate timers from dead cpu, being careful of lock order vs __mod_timer. radix_tree.c: free dead cpu's radix_tree_preloads page_alloc.c: empty dead cpu's nr_pagecache_local into nr_pagecache, and free pages on cpu's local cache. slab.c: stop reap_timer for dead cpu, adjust each cache's free limit, and free each slab cache's per-cpu block. swap.c: drain dead cpu's lru_add_pvecs into ours, and empty its committed_space counter into global counter. dev.c: drain device queues from dead cpu into this one. flow.c: drain dead cpu's flow cache.
-
- 06 Mar, 2004 3 commits
-
-
Andrew Morton authored
Dave Kleikamp <shaggy@austin.ibm.com> points out a race between nobh_prepare_write() and end_buffer_read_sync(). end_buffer_read_sync() calls unlock_buffer(), waking the nobh_prepare_write() thread, which immediately frees the buffer_head. end_buffer_read_sync() then calls put_bh() which decrements b_count for the already freed structure. The SLAB_DEBUG code detects the slab corruption. We fix this by giving nobh_prepare_write() a private buffer_head end_o handler which doesn't touch the buffer's contents after unlocking it.
-
Andrew Morton authored
From: Eric Sandeen <sandeen@sgi.com> Several functions in buffer.c are using unsigned long where they should be using sector_t. Also, use pgoff_t in several places so it is easier to tell what is beingused as a pagecache index, what is being used as a disk index and what is being used as an offset-into-page.
-
Andrew Morton authored
From: Gerd Knorr <kraxel@suse.de> Current gcc's error out if a function's declaration and definition disagree about the register passing convention. The patch adds a new `fastcall' declatation primitive, and uses that in all the FASTCALL functions which we could find. A number of inconsistencies were fixed up along the way.
-
- 18 Feb, 2004 1 commit
-
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> Three more removed CPU notifiers extracted from the hotplug CPU patch. kernel/softirq.c: the tasklet cpu prepration callback is useless: the vectors are already initialized to NULL. Even with the hotplug CPU patches, they're of little or no use. fs/buffer.c: once again, they are already initialized to zero. mm/page_alloc.c: once again, already initialized to zero.
-
- 20 Jan, 2004 1 commit
-
-
Andrew Morton authored
Ratelimit a couple of potentially-stormy printk's in the writeback code.
-
- 19 Jan, 2004 3 commits
-
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> Some places use cpu_online() where they should be using cpu_possible, most commonly for tallying statistics. This makes no difference without hotplug CPU. Use the for_each_cpu() macro in those places, providing good examples (and making the external hotplug CPU patch smaller). Some places use cpu_online() where they should be using cpu_possible, most commonly for tallying statistics. This makes no difference without hotplug CPU. Use the for_each_cpu() macro in those places, providing good examples (and making the external hotplug CPU patch smaller).
-
Andrew Morton authored
From: Rik van Riel <riel@surriel.com> In 2.6.0 both __alloc_pages() and the corresponding wakeup_kswapd()s walk all zones in the zone list, possibly spanning multiple nodes in a low numa factor system like AMD64. Also, if lower_zone_protection is set in /proc, then it may be possible that kswapd never cleans out data in zones further down the zonelist and try_to_free_pages needs to do that. However, in 2.6.0 try_to_free_pages() only frees pages in the pgdat the first zone in the zonelist belongs to. This is probably the wrong behaviour, since both the page allocator and the kswapd wakeup free things from all zones on the zonelist. The following patch makes try_to_free_pages() consistent with the allocator, by passing the zonelist as an argument and freeing pages from all zones in the list. I do not have any numa systems myself, so I have only tested it on my own little smp box. Testing on NUMA systems may be useful, though the patch really only should have an impact in those rare cases where kswapd can't keep up with allocations... As a side effect, the patch shrinks the kernel by 2 lines and replaces some subtle magic by a simpler array walk.
-
Andrew Morton authored
From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk> In a bunch of places we used file->f_dentry->d_inode->i_sem to protect fdatasync et.al. Replaced with corrent file->f_mapping->host->i_sem - the object we are protecting is address_space, so we want an exclusion that would work for redirected ->i_mapping. For normal files (not coda, not bdev) it's all the same, of course - there we have file->f_mapping->host == file->f_dentry->d_inode and change above is an equivalent transfromation.
-
- 30 Dec, 2003 1 commit
-
-
Andrew Morton authored
From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Suppress a buffer_error() warning which occurs when a page which previously had an I/O error gets its buffers stripped.
-
- 29 Sep, 2003 1 commit
-
-
Arnaldo Carvalho de Melo authored
-
- 19 Aug, 2003 3 commits
-
-
Andrew Morton authored
From: Oliver Xymoron <oxymoron@waste.org> Currently, a writepage() which detects that it is writing outside i_size (due to concurrent truncate) will abandon the write, returning -EIO. The return value will bogusly cause an error to be recorded in the address_space. So convert all those writepage() instances to return zero in this case.
-
Andrew Morton authored
From: Oliver Xymoron <oxymoron@waste.org> This patch just saves a few bytes in the inode by turning mapping->gfp_mask into an unsigned long mapping->flags. The mapping's gfp mask is placed in the 16 high bits of mapping->flags and two of the remaining 16 bits are used for tracking EIO and ENOSPC errors. This leaves 14 bits in the mapping for future use. They should be accessed with the atomic bitops.
-
Andrew Morton authored
From: Oliver Xymoron <oxymoron@waste.org> These patches add the infrastructure for reporting asynchronous write errors to block devices to userspace. Error which are detected due to pdflush or VM writeout are reported at the next fsync, fdatasync, or msync on the given file, and on close if the error occurs in time. We do this by propagating any errors into page->mapping->error when they are detected. In fsync(), msync(), fdatasync() and close() we return that error and zero it out. The Open Group say close() _may_ fail if an I/O error occurred while reading from or writing to the file system. Well, in this implementation close() can return -EIO or -ENOSPC. And in that case it will succeed, not fail - perhaps that is what they meant. There are three patches in this series and testing has only been performed with all three applied.
-
- 06 Aug, 2003 1 commit
-
-
Andrew Morton authored
The problem with PF_READAHEAD is that if someone does a non-GFP_ATOMIC memory allocation we can enter page reclaim and then call writepage, while PF_READAHEAD is set. The block layer then drops writes or the wrong reads on the floor. It can cause data loss. A fix is complex (well, intrusive). Given that the readahead code is now skipping the entire readahead attempt if the queue is congested, the setting of PF_READAHEAD probably is not doing anything useful anyway, so simply remove it.
-