- 16 Oct, 2007 40 commits
-
-
Nick Piggin authored
Revert the patch from Neil Brown to optimise NFSD writev handling. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hisashi Hifumi authored
While running some memory intensive load, system response deteriorated just after swap-out started. The cause of this problem is that when a PG_reclaim page is moved to the tail of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is acquired every page writeback . This deteriorates system performance and makes interrupt hold off time longer when swap-out started. Following patch solves this problem. I use pagevec in rotating reclaimable pages to mitigate LRU spin lock contention and reduce interrupt hold off time. I did a test that allocating and touching pages in multiple processes, and pinging to the test machine in flooding mode to measure response under memory intensive load. The test result is: -2.6.23-rc5 --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma 17.746/0.092 ms -2.6.23-rc5-patched --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms Max round-trip-time was improved. The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled) 8GB memory , 8GB swap. I did ping test again to observe performance deterioration caused by taking a ref. -2.6.23-rc6-with-modifiedpatch --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms The result for my original patch is as follows. -2.6.23-rc5-with-originalpatch --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms The influence to response was small. [akpm@linux-foundation.org: fix uninitalised var warning] [hugh@veritas.com: fix locking] [randy.dunlap@oracle.com: fix function declaration] [hugh@veritas.com: fix BUG at include/linux/mm.h:220!] [hugh@veritas.com: kill redundancy in rotate_reclaimable_page] [hugh@veritas.com: move_tail_pages into lru_add_drain] Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Lee Schermerhorn authored
Allow an application to query the memories allowed by its context. Updated numa_memory_policy.txt to mention that applications can use this to obtain allowed memories for constructing valid policies. TODO: update out-of-tree libnuma wrapper[s], or maybe add a new wrapper--e.g., numa_get_mems_allowed() ? Also, update numa syscall man pages. Tested with memtoy V>=0.13. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Rik van Riel authored
The current VM can get itself into trouble fairly easily on systems with a small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory. On one side, page_alloc() will allocate down to zone->pages_low, while on the other side, kswapd() and balance_pgdat() will try to free memory from every zone, until every zone has more free pages than zone->pages_high. Highmem can be filled up to zone->pages_low with page tables, ramfs, vmalloc allocations and other unswappable things quite easily and without many bad side effects, since we still have a huge ZONE_NORMAL to do future allocations from. However, as long as the number of free pages in the highmem zone is below zone->pages_high, kswapd will continue swapping things out from ZONE_NORMAL, too! Sami Farin managed to get his system into a stage where kswapd had freed about 700MB of low memory and was still "going strong". The attached patch will make kswapd stop paging out data from zones when there is more than enough memory free. We do go above zone->pages_high in order to keep pressure between zones equal in normal circumstances, but the patch should prevent the kind of excesses that made Sami's computer totally unusable. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jesper Juhl authored
vmalloc() returns a void pointer, so there's no need to cast its return value in mm/page_alloc.c::zone_wait_table_init(). Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jeff Moyer authored
A while back, Nick Piggin introduced a patch to reduce the node memory usage for small files (commit cfd9b7df): -#define RADIX_TREE_MAP_SHIFT 6 +#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) Unfortunately, he didn't take into account the fact that the calculation of the maximum path was based on an assumption of having to round up: #define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2) So, if CONFIG_BASE_SMALL is set, you will end up with a RADIX_TREE_MAX_PATH that is one greater than necessary. The practical upshot of this is just a bit of wasted memory (one long in the height_to_maxindex array, an extra pre-allocated radix tree node per cpu, and extra stack usage in a couple of functions), but it seems worth getting right. It's also worth noting that I never build with CONFIG_BASE_SMALL. What I did to test this was duplicate the code in a small user-space program and check the results of the calculations for max path and the contents of the height_to_maxindex array. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nick Piggin authored
nobh mode error handling is not just pretty slack, it's wrong. One cannot zero out the whole page to ensure new blocks are zeroed, because it just brings the whole page "uptodate" with zeroes even if that may not be the correct uptodate data. Also, other parts of the page may already contain dirty data which would get lost by zeroing it out. Thirdly, the writeback of zeroes to the new blocks will also erase existing blocks. All these conditions are pagecache and/or filesystem corruption. The problem comes about because we didn't keep track of which buffers actually are new or old. However it is not enough just to keep only this state, because at the point we start dirtying parts of the page (new blocks, with zeroes), the handling of IO errors becomes impossible without buffers because the page may only be partially uptodate, in which case the page flags allone cannot capture the state of the parts of the page. So allocate all buffers for the page upfront, but leave them unattached so that they don't pick up any other references and can be freed when we're done. If the error path is hit, then zero the new buffers as the regular buffer path does, then attach the buffers to the page so that it can actually be written out correctly and be subject to the normal IO error handling paths. As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page systems. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dmitry Monakhov authored
Move duplicated code from end_buffer_read_XXX methods to separate helper function. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
A NULL pointer means that the object was not allocated. One cannot determine the size of an object that has not been allocated. Currently we return 0 but we really should BUG() on attempts to determine the size of something nonexistent. krealloc() interprets NULL to mean a zero sized object. Handle that separately in krealloc(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dean Nelson authored
The calculation of pgoff in do_linear_fault() should use PAGE_SHIFT and not PAGE_CACHE_SHIFT since vma->vm_pgoff is in units of PAGE_SIZE and not PAGE_CACHE_SIZE. At the moment linux/pagemap.h has PAGE_CACHE_SHIFT defined as PAGE_SHIFT, but should that ever change this calculation would break. Signed-off-by: Dean Nelson <dcn@sgi.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Satyam Sharma authored
Considering kfree(NULL) would normally occur only in error paths and kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the condition check in SLUB's and SLOB's kfree() to optimize for the common case. SLAB has this already. Signed-off-by: Satyam Sharma <satyam@infradead.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Martin Schwidefsky authored
Move the definitions of struct mm_struct and struct vma_area_struct to include/mm_types.h. This allows to define more function in asm/pgtable.h and friends with inline assemblies instead of macros. Compile tested on i386, powerpc, powerpc64, s390-32, s390-64 and x86_64. [aurelien@aurel32.net: build fix] Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nick Piggin authored
Rather than sign direct radix-tree pointers with a special bit, sign the indirect one that hangs off the root. This means that, given a lookup_slot operation, the invalid result will be differentiated from the valid (previously, valid results could have the bit either set or clear). This does not affect slot lookups which occur under lock -- they can never return an invalid result. Is needed in future for lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nick Piggin authored
__add_to_swap_cache unconditionally sets the page locked, which can be a bit alarming to the unsuspecting reader: in the code paths where the page is visible to other CPUs, the page should be (and is) already locked. Instead, just add a check to ensure the page is locked here, and teach the one path relying on the old behaviour to call SetPageLocked itself. [hugh@veritas.com: locking fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nick Piggin authored
find_lock_page does not need to recheck ->index because if the page is in the right mapping then the index must be the same. Also, tree_lock does not need to be retaken after the page is locked in order to test that ->mapping has not changed, because holding the page lock pins its mapping. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nick Piggin authored
Probing pages and radix_tree_tagged are lockless operations with the lockless radix-tree. Convert these users to RCU locking rather than using tree_lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nick Piggin authored
The commit b5810039 contains the note A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. And indeed this cacheline bouncing has shown up on large SGI systems. There was a situation where an Altix system was essentially livelocked tearing down ZERO_PAGE pagetables when an HPC app aborted during startup. This situation can be avoided in userspace, but it does highlight the potential scalability problem with refcounting ZERO_PAGE, and corner cases where it can really hurt (we don't want the system to livelock!). There are several broad ways to fix this problem: 1. add back some special casing to avoid refcounting ZERO_PAGE 2. per-node or per-cpu ZERO_PAGES 3. remove the ZERO_PAGE completely I will argue for 3. The others should also fix the problem, but they result in more complex code than does 3, with little or no real benefit that I can see. Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a false optimisation: if an application is performance critical, it would not be doing many read faults of new memory, or at least it could be expected to write to that memory soon afterwards. If cache or memory use is critical, it should not be working with a significant number of ZERO_PAGEs anyway (a more compact representation of zeroes should be used). As a sanity check -- mesuring on my desktop system, there are never many mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not increase much without it. When running a make -j4 kernel compile on my dual core system, there are about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000 ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second is torn down without being COWed). So removing ZERO_PAGE will save 1,000 page faults per second when running kbuild, while keeping it only saves less than 1 page clearing operation per second. 1 page clear is cheaper than a thousand faults, presumably, so there isn't an obvious loss. Neither the logical argument nor these basic tests give a guarantee of no regressions. However, this is a reasonable opportunity to try to remove the ZERO_PAGE from the pagefault path. If it is found to cause regressions, we can reintroduce it and just avoid refcounting it. The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see much use to them except on benchmarks. All other users of ZERO_PAGE are converted just to use ZERO_PAGE(0) for simplicity. We can look at replacing them all and maybe ripping out ZERO_PAGE completely when we are more satisfied with this solution. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
This gets rid of all kmalloc caches larger than page size. A kmalloc request larger than PAGE_SIZE > 2 is going to be passed through to the page allocator. This works both inline where we will call __get_free_pages instead of kmem_cache_alloc and in __kmalloc. kfree is modified to check if the object is in a slab page. If not then the page is freed via the page allocator instead. Roughly similar to what SLOB does. Advantages: - Reduces memory overhead for kmalloc array - Large kmalloc operations are faster since they do not need to pass through the slab allocator to get to the page allocator. - Performance increase of 10%-20% on alloc and 50% on free for PAGE_SIZEd allocations. SLUB must call page allocator for each alloc anyways since the higher order pages which that allowed avoiding the page alloc calls are not available in a reliable way anymore. So we are basically removing useless slab allocator overhead. - Large kmallocs yields page aligned object which is what SLAB did. Bad things like using page sized kmalloc allocations to stand in for page allocate allocs can be transparently handled and are not distinguishable from page allocator uses. - Checking for too large objects can be removed since it is done by the page allocator. Drawbacks: - No accounting for large kmalloc slab allocations anymore - No debugging of large kmalloc slab allocations. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
Convert some 'unsigned long' to pgoff_t. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
- remove unused local next_index in do_generic_mapping_read() - remove a redudant page_cache_read() declaration Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
Remove the size limit max_sectors_kb imposed on max_readahead_kb. The size restriction is unreasonable. Especially when max_sectors_kb cannot grow larger than max_hw_sectors_kb, which can be rather small for some disk drives. Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
The local copy of ra in do_generic_mapping_read() can now go away. It predates readanead(req_size). In a time when the readahead code was called on *every* single page. Hence a local has to be made to reduce the chance of the readahead state being overwritten by a concurrent reader. More details in: Linux: Random File I/O Regressions In 2.6 <http://kerneltrap.org/node/3039> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
This is a simplified version of the pagecache context based readahead. It handles the case of multiple threads reading on the same fd and invalidating each others' readahead state. It does the trick by scanning the pagecache and recovering the current read stream's readahead status. The algorithm works in a opportunistic way, in that it does not try to detect interleaved reads _actively_, which requires a probe into the page cache (which means a little more overhead for random reads). It only tries to handle a previously started sequential readahead whose state was overwritten by another concurrent stream, and it can do this job pretty well. Negative and positive examples(or what you can expect from it): 1) it cannot detect and serve perfect request-by-request interleaved reads right: time stream 1 stream 2 0 1 1 1001 2 2 3 1002 4 3 5 1003 6 4 7 1004 8 5 9 1005 Here no single readahead will be carried out. 2) However, if it's two concurrent reads by two threads, the chance of the initial sequential readahead be started is huge. Once the first sequential readahead is started for a stream, this patch will ensure that the readahead window continues to rampup and won't be disturbed by other streams. time stream 1 stream 2 0 1 1 2 2 1001 3 3 4 1002 5 1003 6 4 7 5 8 1004 9 6 10 1005 11 7 12 1006 13 1007 Here stream 1 will start a readahead at page 2, and stream 2 will start its first readahead at page 1003. From then on the two streams will be served right. Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
Introduce radix_tree_next_hole(root, index, max_scan) to scan radix tree for the first hole. It will be used in interleaved readahead. The implementation is dumb and obviously correct. It can help debug(and document) the possible smart one in future. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
Combine the file_ra_state members unsigned long prev_index unsigned int prev_offset into loff_t prev_pos It is more consistent and better supports huge files. Thanks to Peter for the nice proposal! [akpm@linux-foundation.org: fix shift overflow] Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Fengguang Wu authored
Use 'unsigned int' instead of 'unsigned long' for readahead sizes. This helps reduce memory consumption on 64bit CPU when a lot of files are opened. CC: Andi Kleen <andi@firstfloor.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jesper Juhl authored
This patch cleans up duplicate includes in mm/ Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jesper Juhl authored
This patch cleans up duplicate includes in include/linux/memory_hotplug.h Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Will Schmidt authored
We have had complaints where a threaded application is left in a bad state after one of it's threads is killed when we hit a VM: out_of_memory condition. Killing just one of the process threads can leave the application in a bad state, whereas killing the entire process group would allow for the application to restart, or be otherwise handled, and makes it very obvious that something has gone wrong. This change allows the entire process group to be taken down, rather than just the one thread. Signed-off-by: Will Schmidt <will_schmidt@vnet.ibm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Adrian Bunk authored
WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes') ... Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andy Whitcroft authored
Enable virtual memmap support for SPARSEMEM on PPC64 systems. Slice a 16th off the end of the linear mapping space and use that to hold the vmemmap. Uses the same size mapping as uses in the linear 1:1 kernel mapping. [pbadari@gmail.com: fix warning] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Miller authored
[apw@shadowen.org: style fixups] [apw@shadowen.org: vmemmap sparc64: convert to new config options] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
Equip IA64 sparsemem with a virtual memmap. This is similar to the existing CONFIG_VIRTUAL_MEM_MAP functionality for DISCONTIGMEM. It uses a PAGE_SIZE mapping. This is provided as a minimally intrusive solution. We split the 128TB VMALLOC area into two 64TB areas and use one for the virtual memmap. This should replace CONFIG_VIRTUAL_MEM_MAP long term. [apw@shadowen.org: convert to new helper based initialisation] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
x86_64 uses 2M page table entries to map its 1-1 kernel space. We also implement the virtual memmap using 2M page table entries. So there is no additional runtime overhead over FLATMEM, initialisation is slightly more complex. As FLATMEM still references memory to obtain the mem_map pointer and SPARSEMEM_VMEMMAP uses a compile time constant, SPARSEMEM_VMEMMAP should be superior. With this SPARSEMEM becomes the most efficient way of handling virt_to_page, pfn_to_page and friends for UP, SMP and NUMA on x86_64. [apw@shadowen.org: code resplit, style fixups] [apw@shadowen.org: vmemmap x86_64: ensure end of section memmap is initialised] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andy Whitcroft authored
Convert the common vmemmap population into initialisation helpers for use by architecture vmemmap populators. All architecture implementing the SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate() initialiser, which may make use of the helpers. This allows us to clean up and remove the initialisation Kconfig entries. With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to indicate use of that variant. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christoph Lameter authored
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all the arches. It would be great if it could be the default so that we can get rid of various forms of DISCONTIG and other variations on memory maps. So far what has hindered this are the additional lookups that SPARSEMEM introduces for virt_to_page and page_address. This goes so far that the code to do this has to be kept in a separate function and cannot be used inline. This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap is mapped into a virtually contigious area, only the active sections are physically backed. This allows virt_to_page page_address and cohorts become simple shift/add operations. No page flag fields, no table lookups, nothing involving memory is required. The two key operations pfn_to_page and page_to_page become: #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) ((page) - vmemmap) By having a virtual mapping for the memmap we allow simple access without wasting physical memory. As kernel memory is typically already mapped 1:1 this introduces no additional overhead. The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages. This vill make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits. However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM. FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry. vmemmap is a constant to which we can simply add the offset. This patch has the potential to allow us to make SPARSMEM the default (and even the only) option for most systems. It should be optimal on UP, SMP and NUMA on most platforms. Then we may even be able to remove the other memory models: FLATMEM, DISCONTIG etc. [apw@shadowen.org: config cleanups, resplit code etc] [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init] [apw@shadowen.org: vmemmap: remove excess debugging] [apw@shadowen.org: simplify initialisation code and reduce duplication] [apw@shadowen.org: pull out the vmemmap code into its own file] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andy Whitcroft authored
We have flags to indicate whether a section actually has a valid mem_map associated with it. This is never set and we rely solely on the present bit to indicate a section is valid. By definition a section is not valid if it has no mem_map and there is a window during init where the present bit is set but there is no mem_map, during which pfn_valid() will return true incorrectly. Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid mem_map. Switch valid_section{,_nr} and pfn_valid() to this bit. Add a new present_section{,_nr} and pfn_present() interfaces for those users who care to know that a section is going to be valid. [akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andy Whitcroft authored
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all the arches. It would be great if it could be the default so that we can get rid of various forms of DISCONTIG and other variations on memory maps. So far what has hindered this are the additional lookups that SPARSEMEM introduces for virt_to_page and page_address. This goes so far that the code to do this has to be kept in a separate function and cannot be used inline. This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap is mapped into a virtually contigious area, only the active sections are physically backed. This allows virt_to_page page_address and cohorts become simple shift/add operations. No page flag fields, no table lookups, nothing involving memory is required. The two key operations pfn_to_page and page_to_page become: #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) ((page) - vmemmap) By having a virtual mapping for the memmap we allow simple access without wasting physical memory. As kernel memory is typically already mapped 1:1 this introduces no additional overhead. The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages. This vill make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits. However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM. FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry. vmemmap is a constant to which we can simply add the offset. This patch has the potential to allow us to make SPARSMEM the default (and even the only) option for most systems. It should be optimal on UP, SMP and NUMA on most platforms. Then we may even be able to remove the other memory models: FLATMEM, DISCONTIG etc. The current aim is to bring a common virtually mapped mem_map to all architectures. This should facilitate the removal of the bespoke implementations from the architectures. This also brings performance improvements for most architecture making sparsmem vmemmap the more desirable memory model. The ultimate aim of this work is to expand sparsemem support to encompass all the features of the other memory models. This could allow us to drop support for and remove the other models in the longer term. Below are some comparitive kernbench numbers for various architectures, comparing default memory model against SPARSEMEM VMEMMAP. All but ia64 show marginal improvement; we expect the ia64 figures to be sorted out when the larger mapping support returns. x86-64 non-NUMA Base VMEMAP % change (-ve good) User 85.07 84.84 -0.26 System 34.32 33.84 -1.39 Total 119.38 118.68 -0.59 ia64 Base VMEMAP % change (-ve good) User 1016.41 1016.93 0.05 System 50.83 51.02 0.36 Total 1067.25 1067.95 0.07 x86-64 NUMA Base VMEMAP % change (-ve good) User 30.77 431.73 0.22 System 45.39 43.98 -3.11 Total 476.17 475.71 -0.10 ppc64 Base VMEMAP % change (-ve good) User 488.77 488.35 -0.09 System 56.92 56.37 -0.97 Total 545.69 544.72 -0.18 Below are some AIM bencharks on IA64 and x86-64 (thank Bob). The seems pretty much flat as you would expect. ia64 results 2 cpu non-numa 4Gb SCSI disk Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 1 07:17:24 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 98.9 100 58.9 1.3 1.6482 101 5547.1 95 106.0 79.4 0.9154 201 6377.7 95 183.4 158.3 0.5288 301 6932.2 95 252.7 237.3 0.3838 401 7075.8 93 329.8 316.7 0.2941 501 7235.6 94 403.0 396.2 0.2407 600 7387.5 94 472.7 475.0 0.2052 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 1 09:59:04 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 99.1 100 58.8 1.2 1.6509 101 5480.9 95 107.2 79.2 0.9044 201 6490.3 95 180.2 157.8 0.5382 301 6886.6 94 254.4 236.8 0.3813 401 7078.2 94 329.7 316.0 0.2942 501 7250.3 95 402.2 395.4 0.2412 600 7399.1 94 471.9 473.9 0.2055 open power 710 2 cpu, 4 Gb, SCSI and configured physically Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme May 29 15:42:53 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 25.7 100 226.3 4.3 0.4286 101 1096.0 97 536.4 199.8 0.1809 201 1236.4 96 946.1 389.1 0.1025 301 1280.5 96 1368.0 582.3 0.0709 401 1270.2 95 1837.4 771.0 0.0528 501 1251.4 96 2330.1 955.9 0.0416 601 1252.6 96 2792.4 1139.2 0.0347 701 1245.2 96 3276.5 1334.6 0.0296 918 1229.5 96 4345.4 1728.7 0.0223 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap May 30 07:28:26 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 25.6 100 226.9 4.3 0.4275 101 1049.3 97 560.2 198.1 0.1731 201 1199.1 97 975.6 390.7 0.0994 301 1261.7 96 1388.5 591.5 0.0699 401 1256.1 96 1858.1 771.9 0.0522 501 1220.1 96 2389.7 955.3 0.0406 601 1224.6 96 2856.3 1133.4 0.0340 701 1252.0 96 3258.7 1314.1 0.0298 915 1232.8 96 4319.7 1704.0 0.0225 amd64 2 2-core, 4Gb and SATA Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 2 03:59:48 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 13.0 100 446.4 2.1 0.2173 101 533.4 97 1102.0 110.2 0.0880 201 578.3 97 2022.8 220.8 0.0480 301 583.8 97 3000.6 332.3 0.0323 401 580.5 97 4020.1 442.2 0.0241 501 574.8 98 5072.8 558.8 0.0191 600 566.5 98 6163.8 671.0 0.0157 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 3 04:19:31 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 13.0 100 447.8 2.0 0.2166 101 536.5 97 1095.6 109.7 0.0885 201 567.7 97 2060.5 219.3 0.0471 301 582.1 96 3009.4 330.2 0.0322 401 578.2 96 4036.4 442.4 0.0240 501 585.1 98 4983.2 555.1 0.0195 600 565.5 98 6175.2 660.6 0.0157 This patch: Fix some spelling errors. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-