An error occurred fetching the project authors.
- 31 Aug, 2003 1 commit
-
-
Andrew Morton authored
Off-by-one in balance_pgdat(): `priority' can never go negative. It causes the scanning priority thresholds to be quite wrong and kswapd tends to go berzerk when there is a lot of mapped memory around.
-
- 21 Aug, 2003 1 commit
-
-
Patrick Mochel authored
Calls were moved to the PM core, so they must be compiled in to use them.
-
- 20 Aug, 2003 1 commit
-
-
Andrew Morton authored
In a further attempt to prevent dirty pages from being written out from the LRU, don't write them if they were referenced. This gives those pages another trip around the inactive list. So more of them are written via balance_dirty_pages(). It speeds up an untar-of-five-kernel trees by 5% on a 256M box, presumably because balance_dirty_pages() has better IO patterns. It largely fixes the problem which Gerrit talked about at the kernel summit: the individual writepage()s of dirty pages coming off the tail of the LRU are reduced by 83% in their database workload. I'm a bit worried that it increases scanning and OOM possibilities under nutty VM stress cases, but nothing untoward has been noted during its four weeks in -mm, so...
-
- 19 Aug, 2003 2 commits
-
-
Andrew Morton authored
From: Oliver Xymoron <oxymoron@waste.org> This patch just saves a few bytes in the inode by turning mapping->gfp_mask into an unsigned long mapping->flags. The mapping's gfp mask is placed in the 16 high bits of mapping->flags and two of the remaining 16 bits are used for tracking EIO and ENOSPC errors. This leaves 14 bits in the mapping for future use. They should be accessed with the atomic bitops.
-
Andrew Morton authored
From: Oliver Xymoron <oxymoron@waste.org> These patches add the infrastructure for reporting asynchronous write errors to block devices to userspace. Error which are detected due to pdflush or VM writeout are reported at the next fsync, fdatasync, or msync on the given file, and on close if the error occurs in time. We do this by propagating any errors into page->mapping->error when they are detected. In fsync(), msync(), fdatasync() and close() we return that error and zero it out. The Open Group say close() _may_ fail if an I/O error occurred while reading from or writing to the file system. Well, in this implementation close() can return -EIO or -ENOSPC. And in that case it will succeed, not fail - perhaps that is what they meant. There are three patches in this series and testing has only been performed with all three applied.
-
- 18 Aug, 2003 1 commit
-
-
Andrew Morton authored
From: William Lee Irwin III <wli@holomorphy.com> Contributions from: Jan Dittmer <jdittmer@sfhq.hn.org> Arnd Bergmann <arnd@arndb.de> "Bryan O'Sullivan" <bos@serpentine.com> "David S. Miller" <davem@redhat.com> Badari Pulavarty <pbadari@us.ibm.com> "Martin J. Bligh" <mbligh@aracnet.com> Zwane Mwaikambo <zwane@linuxpower.ca> It has ben tested on x86, sparc64, x86_64, ia64 (I think), ppc and ppc64. cpumask_t enables systems with NR_CPUS > BITS_PER_LONG to utilize all their cpus by creating an abstract data type dedicated to representing cpu bitmasks, similar to fd sets from userspace, and sweeping the appropriate code to update callers to the access API. The fd set-like structure is according to Linus' own suggestion; the macro calling convention to ambiguate representations with minimal code impact is my own invention. Specifically, a new set of inline functions for manipulating arbitrary-width bitmaps is introduced with a relatively simple implementation, in tandem with a new data type representing bitmaps of width NR_CPUS, cpumask_t, whose accessor functions are defined in terms of the bitmap manipulation inlines. This bitmap ADT found an additional use in i386 arch code handling sparse physical APIC ID's, which was convenient to use in this case as the accounting structure was required to be wider to accommodate the physids consumed by larger numbers of cpus. For the sake of simplicity and low code impact, these cpu bitmasks are passed primarily by value; however, an additional set of accessors along with an auxiliary data type with const call-by-reference semantics is provided to address performance concerns raised in connection with very large systems, such as SGI's larger models, where copying and call-by-value overhead would be prohibitive. Few (if any) users of the call-by-reference API are immediately introduced. Also, in order to avoid calling convention overhead on architectures where structures are required to be passed by value, NR_CPUS <= BITS_PER_LONG is special-cased so that cpumask_t falls back to an unsigned long and the accessors perform the usual bit twiddling on unsigned longs as opposed to arrays thereof. Audits were done with the structure overhead in-place, restoring this special-casing only afterward so as to ensure a more complete API conversion while undergoing the majority of its end-user exposure in -mm. More -mm's were shipped after its restoration to be sure that was tested, too. The immediate users of this functionality are Sun sparc64 systems, SGI mips64 and ia64 systems, and IBM ia32, ppc64, and s390 systems. Of these, only the ppc64 machines needing the functionality have yet to be released; all others have had systems requiring it for full functionality for at least 6 months, and in some cases, since the initial Linux port to the affected architecture.
-
- 01 Aug, 2003 4 commits
-
-
Andrew Morton authored
From: Nikita Danilov <Nikita@Namesys.COM> Use zone->pressure (rathar than scanning priority) to determine when to start reclaiming mapped pages in refill_inactive_zone(). When using priority every call to try_to_free_pages() starts with scanning parts of active list and skipping mapped pages (because reclaim_mapped evaluates to 0 on low priorities) no matter how high memory pressure is.
-
Andrew Morton authored
From: Nikita Danilov <Nikita@Namesys.COM> The vmscan logic at present will scan the inactive list with increasing priority until a threshold is triggered. At that threshold we start unmapping pages from pagetables. The problem is that each time someone calls into this code, the priority is initially low, so some mapped pages will be refiled event hough we really should be unmapping them now. Nikita's patch adds the `pressure' field to struct zone. it is a decaying average of the zone's memory pressure and allows us to start unmapping pages immediately on entry to page reclaim, based on measurements which were made in earlier reclaim attempts.
-
Andrew Morton authored
kswapd currently takes a throttling nap even if it freed all the pages it was asked to free. Change it so we only throttle if reclaim is not being sufficiently successful.
-
Andrew Morton authored
We need to subtract the number of freed slab pages from the number of pages to free, not add it.
-
- 12 Jul, 2003 1 commit
-
-
Bernardo Innocenti authored
- __div64_32(): remove __attribute_pure__ qualifier from the prototype since this function obviously clobbers memory through &(n); - do_div(): add a check to ensure (n) is type-compatible with uint64_t; - as_update_iohist(): Use sector_div() instead of do_div(). (Whether the result of the addition should always be stored in 64bits regardless of CONFIG_LBD is still being discussed, therefore it's unadderessed here); - Fix all places where do_div() was being called with a bad divisor argument.
-
- 14 Jun, 2003 1 commit
-
-
Andrew Morton authored
From: Anton Blanchard <anton@samba.org> Anton has been testing odd setups: /* node 0 - no cpus, no memory */ /* node 1 - 1 cpu, no memory */ /* node 2 - 0 cpus, 1GB memory */ /* node 3 - 3 cpus, 3GB memory */ Two things tripped so far. Firstly the ppc64 debug check for invalid cpus in cpu_to_node(). Fix that in kernel/sched.c:node_nr_running_init(). The other problem concerned nodes with memory but no cpus. kswapd tries to set_cpus_allowed(0) and bad things happen. So we only set cpu affinity for kswapd if there are cpus in the node.
-
- 06 Jun, 2003 1 commit
-
-
Andrew Morton authored
From: Matthew Dobson <colpatch@us.ibm.com> sched_best_cpu schedules processes on nodes based on node_nr_running. For CPU-less nodes, this is always 0, and thus sched_best_cpu tends to migrate tasks to these nodes, which eventually get remigrated elsewhere. This patch adds include/linux/topology.h, and modifies all includes of asm/topology.h to linux/topology.h. A subsequent patch in this series adds helper functions to linux/topology.h to ensure processes are only migrated to nodes with CPUs. Test compiled and booted by Andrew Theurer (habanero@us.ibm.com) on both x440 and ppc64.
-
- 22 May, 2003 1 commit
-
-
Andrew Morton authored
The calling task must have a valid reclaim_state when running page reclaim. But I had forgotten about shrink_all_memory().
-
- 07 May, 2003 1 commit
-
-
Andrew Morton authored
try_to_free_pages() currently fails to notice that it successfully freed slab pages via shrink_slab(). So it can keep looping and eventually call out_of_memory(), even though there's a lot of memory now free. And even if it doesn't do that, it can free too much memory. The patch changes try_to_free_pages() so that it will notice freed slab pages and will return when enough memory has been freed via shrink_slab(). Many options were considered, but must of them were unacceptably inaccurate, intrusive or sleazy. I ended up putting the accounting into a stack-local structure which is pointed to by current->reclaim_state. One reason for this is that we can cleanly resurrect the current->local_pages pool by putting it into struct reclaim_state. (current->local_pages was removed because the per-cpu page pools in the page allocator largely duplicate its function. But it is still possible for interrupt-time allocations to steal just-freed pages, so we might want to put it back some time.)
-
- 30 Apr, 2003 1 commit
-
-
Andrew Morton authored
Fix a bug identified by Nikita Danilov: refill_inactive_zone() is deferring the update of zone->nr_inactive and zone->nr_active for too long - it needs to be consistent whenever zone->lock is not held.
-
- 20 Apr, 2003 3 commits
-
-
Andrew Morton authored
From: William Lee Irwin III <wli@holomorphy.com> If one's goal is to free highmem pages, shrink_slab() is an ineffective method of recovering them, as slab pages are all ZONE_NORMAL or ZONE_DMA. Hence, this "FIXME: do not do for zone highmem". Presumably this is a question of policy, as highmem allocations may be satisfied by reaping slab pages and handing them back; but the FIXME says what we should do.
-
Andrew Morton authored
This is a cleanup patch. There are quite a lot of places in the kernel which will infinitely retry a memory allocation. Generally, they get it wrong. Some do yield(), the semantics of which have changed over time. Some do schedule(), which can lock up if the caller is SCHED_FIFO/RR. Some do schedule_timeout(), etc. And often it is unnecessary, because the page allocator will do the retry internally anyway. But we cannot rely on that - this behaviour may change (-aa and -rmap kernels do not do this, for instance). So it is good to formalise and to centralise this operation. If an allocation specifies __GFP_REPEAT then the page allocator must infinitely retry the allocation. The semantics of __GFP_REPEAT are "try harder". The allocation _may_ fail (the 2.4 -aa and -rmap VM's do not retry infinitely by default). The semantics of __GFP_NOFAIL are "cannot fail". It is a no-op in this VM, but needs to be honoured (or fix up the callers) if the VM ischanged to not retry infinitely by default. The semantics of __GFP_NOREPEAT are "try once, don't loop". This isn't used at present (although perhaps it should be, in swapoff). It is mainly for completeness.
-
Andrew Morton authored
From: William Lee Irwin III <wli@holomorphy.com> Remove page_has_buffers() from various functions, document the dependencies on buffer_head.h from other files besides filemap.c, and s/this file/core VM/ in filemap.c
-
- 09 Apr, 2003 1 commit
-
-
Andrew Morton authored
Spinlocks don't have a buslocked unlock and are faster. On a P4, time to write a 4M file with 4M one-byte-write()s: Before: 0.72s user 5.47s system 99% cpu 6.227 total 0.76s user 5.40s system 100% cpu 6.154 total 0.77s user 5.38s system 100% cpu 6.146 total After: 1.09s user 4.92s system 99% cpu 6.014 total 0.74s user 5.28s system 99% cpu 6.023 total 1.03s user 4.97s system 100% cpu 5.991 total
-
- 28 Mar, 2003 2 commits
-
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Raised #endif CONFIG_SWAP in shrink_list, it was excluding try_to_unmap of file pages. Suspect !CONFIG_MMU relied on that to suppress try_to_unmap, added SWAP_FAIL stub for it.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Delete unused SWAP_ERROR and non-existent page_over_rsslimit().
-
- 15 Feb, 2003 1 commit
-
-
Andrew Morton authored
blk_congestion_wait() will currently not wait if there are no write requests in flight. Which is a potential problem if all the dirty data is against NFS filesystems. For write(2) traffic against NFS, things work nicely, because writers throttle in nfs_wait_on_requests(). But for MAP_SHARED dirtyings we need to avoid spinning in balance_dirty_pages(). So allow callers to fall through to the explicit sleep in that case. This will also fix a weird lockup which the reiser4 developers report. In that case they have managed to have _all_ inodes against a superblock in locked state, yet there are no write requests in flight. Taking a nap in blk_congestion_wait() in this case will yield the CPU to the threads which are trying to write out pages. Also tune up the sleep durations in various callers - 250 milliseconds seems rather long.
-
- 12 Feb, 2003 1 commit
-
-
Andrew Morton authored
Make the "duplicate const" warning go away. Arguably a compiler bug...
-
- 11 Feb, 2003 1 commit
-
-
Linus Torvalds authored
Add a name argument to daemonize() (va_arg) to avoid all the kernel threads having to duplicate the name setting over and over again. Make daemonize() disable all signals by default, and add a "allow_signal()" function to let daemons say they explicitly want to support a signal. Make flush_signal() take the signal lock, so that callers do not need to.
-
- 06 Feb, 2003 1 commit
-
-
Andrew Morton authored
We don't need these with self-unplugging queues. The patch also contains a couple of microopts suggested by Andrea: we don't need to run sync_page() if the page just came unlocked.
-
- 04 Feb, 2003 2 commits
-
-
Andrew Morton authored
Patch from Matthew Dobson <colpatch@us.ibm.com> When I originally wrote the patches implementing the in-kernel topology macros, they were meant to be called as a second layer of functions, sans underbars. This additional layer was deemed unnecessary and summarily dropped. As such, carrying around (and typing!) all these extra underbars is quite pointless. Here's a patch to nip this in the (sorta) bud. The macros only appear in 16 files so far, most of them being the definitions themselves.
-
Andrew Morton authored
Patch From: Hugh Dickins <hugh@veritas.com> Recently noticed that __GFP_HIGHIO has played no real part since bounce buffering was converted to mempool in 2.5.12: so this patch (over 2.5.58-mm1) removes it and GFP_NOHIGHIO and SLAB_NOHIGHIO. Also removes GFP_KSWAPD, in 2.5 same as GFP_KERNEL; leaves GFP_USER, which can be a useful comment, even though in 2.5 same as GFP_KERNEL. One anomaly needs comment: strictly, if there's no __GFP_HIGHIO, then GFP_NOHIGHIO translates to GFP_NOFS; but GFP_NOFS looks wrong in the block layer, and if you follow them down, you find that GFP_NOFS and GFP_NOIO behave the same way in mempool_alloc - so I've used the less surprising GFP_NOIO to replace GFP_NOHIGHIO.
-
- 21 Dec, 2002 2 commits
-
-
Andrew Morton authored
The `low latency page reclaim' design works by preventing page allocators from blocking on request queues (and by preventing them from blocking against writeback of individual pages, but that is immaterial here). This has a problem under some situations. pdflush (or a write(2) caller) could be saturating the queue with highmem pages. This prevents anyone from writing back ZONE_NORMAL pages. We end up doing enormous amounts of scenning. A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory, then kill the mmapping applications. The machine instantly goes from 0% of memory dirty to 95% or more. pdflush kicks in and starts writing the least-recently-dirtied pages, which are all highmem. The queue is congested so nobody will write back ZONE_NORMAL pages. kswapd chews 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim efficiency (pages_reclaimed/pages_scanned) falls to 2%. So this patch changes the policy for kswapd. kswapd may use all of a request queue, and is prepared to block on request queues. What will now happen in the above scenario is: 1: The page alloctor scans some pages, fails to reclaim enough memory and takes a nap in blk_congetion_wait(). 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing back pages. (These pages will be rotated to the tail of the inactive list at IO-completion interrupt time). This writeback will saturate the queue with ZONE_NORMAL pages. Conveniently, pdflush will avoid the congested queues. So we end up writing the correct pages. In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim efficiency rises from 2% to 40% and things are generally a lot happier. The downside is that kswapd may now do a lot less page reclaim, increasing page allocation latency, causing more direct reclaim, increasing lock contention in the VM, etc. But I have not been able to demonstrate that in testing. The other problem is that there is only one kswapd, and there are lots of disks. That is a generic problem - without being able to co-opt user processes we don't have enough threads to keep lots of disks saturated. One fix for this would be to add an additional "really congested" threshold in the request queues, so kswapd can still perform nonblocking writeout. This gives kswapd priority over pdflush while allowing kswapd to feed many disk queues. I doubt if this will be called for.
-
Andrew Morton authored
There's a small window in which another CPU could dirty the page after we've cleaned it, and before we've moved it to mapping->dirty_pages(). The end result is a dirty page on mapping->locked_pages, which is wrong. So take mapping->page_lock before clearing the dirty bit.
-
- 14 Dec, 2002 5 commits
-
-
Andrew Morton authored
This ad-hoc assertion is no longer true. If all zones are in the `all unreclaimable' state it can trigger. When testing with a tiny amount of physical memory.
-
Andrew Morton authored
current->flags:PF_SYNC was a hack I added because I didn't want to change all ->writepage implementations. It's foul. And it means that if someone happens to run direct page reclaim within the context of (say) sys_sync, the writepage invokations from the VM will be treated as "data integrity" operations, not "memory cleansing" operations, which would cause latency. So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage. It is the `writeback_control' structure which contains the full context information about why writepage was called. The initial version of this patch just passed in a bare `int sync', but the XFS team need more info so they can perform writearound from within page reclaim. The patch also adds writeback_control.for_reclaim, so writepage implementations can inspect that to work out the call context rather than peeking at current->flags:PF_MEMALLOC.
-
Andrew Morton authored
- /proc/vmstat:pageoutrun and /proc/vmstat:allocstall are always identical. Rework this so that - "allocstall" is the number of times a page allocator ran diect reclaim - "pageoutrun" is the number of times kswapd ran page reclaim - Add a new stat: "pgrotated". The number of pages which were rotated to the tail of the LRU for immediate reclaim by rotate_reclaimable_page(). - Document things a bit.
-
Andrew Morton authored
fail_writepage() does not work. Its activate_page() call cannot activate the page because it is not on the LRU. So perform that function (more efficiently) in the VM. Remove fail_writepage() and, if the filesystem does not implement ->writepage() then activate the page from shrink_list(). A special case is tmpfs, which does have a writepage, but which sometimes wants to activate the pages anyway. The most important case is when there is no swap online and we don't want to keep all those pages on the inactive list. So just as a tmpfs special-case, allow writepage() to return WRITEPAGE_ACTIVATE, and handle that in the VM. Also, the whole idea of allowing ->writepage() to return -EAGAIN, and handling that in the caller has been reverted. If a writepage() implementation wants to back out and not write the page, it must redirty the page, unlock it and return zero. (This is Hugh's preferred way). And remove the now-unneeded shmem_writepages() - shmem inodes are marked as `memory backed' so it will not be called. And remove the test for non-null ->writepage() in generic_file_mmap(). Memory-backed files _are_ mmappable, and they do not have a writepage(). It just isn't called. So the locking rules for writepage() are unchanged. They are: - Called with the page locked - Returns with the page unlocked - Must redirty the page itself if it wasn't all written. But there is a new, special, hidden, undocumented, secret hack for tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move the page to the active list. The page must be kept locked in this one case.
-
Andrew Morton authored
The pte_chain_unlock() needs to be outside the ifdef.
-
- 03 Dec, 2002 4 commits
-
-
Andrew Morton authored
With some workloads a large number of pages coming off the LRU are pinned blockdev pagecache - things like ext2 group descriptors, pages which have buffers in the per-cpu buffer LRUs, etc. They keep churning around the inactive list, reducing the overall page reclaim effectiveness. So move these pages onto the active list.
-
Andrew Morton authored
Pages from memory-backed filesystems are supposed to be moved up onto the active list, but that's not working because fail_writepage() is called when the page is not on the LRU. So look for this case in page reclaim and handle it there. And it's more efficient, the VM knows more about what is going on and it later leads to the removal of fail_writepage().
-
Andrew Morton authored
The patch addresses some search complexity failures which occur when there is a large amount of dirty data on the inactive list. Normally we attempt to write out those pages and then move them to the head of the inactive list. But this goes against page aging, and means that the page has to traverse the entire list again before it can be reclaimed. But the VM really wants to reclaim that page - it has reached the tail of the LRU. So what we do in this patch is to mark the page as needing reclamation, and then start I/O. In the IO completion handler we check to see if the page is still probably reclaimable and if so, move it to the tail of the inactive list, where it can be reclaimed immediately. Under really heavy swap-intensive loads this increases the page reclaim efficiency (pages reclaimed/pages scanned) from 10% to 25%. Which is OK for that sort of load. Not great, but OK. This code path takes the LRU lock once per page. I didn't bother playing games with batching up the locking work - it's a rare code path, and the machine has plenty of CPU to spare when this is happening.
-
Andrew Morton authored
This removes the last remnant of the 2.4 way of throttling page allocators: the wait_on_page_writeback() against mapped-or-swapcache pages. I did this because: a) It's not used much. b) It's already causing big latencies c) With Jens' large-queue stuff, it can cause huuuuuuuuge latencies. Like: ninety seconds. So kill it, and rely on blk_congestion_wait() to slow the allocator down to match the rate at which the IO system can retire writes.
-
- 26 Nov, 2002 1 commit
-
-
Andrew Morton authored
Shrinking a huge number of dentries or inodes can hold dcache_lock or inode_lock for a long time. Not only does this hold off preemption - holding those locks basically shuts down the whole VFS. A neat fix for all such caches is to chunk the work up at the shrink_slab() level. I made the chunksize pretty small, for scalability reasons - avoid holding the lock for too long so another CPU can come in, acquire it and go off to do some work.
-