- 22 Nov, 2002 32 commits
-
-
Andrew Morton authored
Implements a new set of block address_space_operations which will never attach buffer_heads to file pagecache. These can be turned on for ext2 with the `nobh' mount option. During write-intensive testing on a 7G machine, total buffer_head storage remained below 0.3 megabytes. And those buffer_heads are against ZONE_NORMAL pagecache and will be reclaimed by ZONE_NORMAL memory pressure. This work is, of course, a special for the huge highmem machines. Possibly it obsoletes the buffer_heads_over_limit stuff (which doesn't work terribly well), but that code is simple, and will provide relief for other filesystems. It should be noted that the nobh_prepare_write() function and the PageMappedToDisk() infrastructure is what is needed to solve the problem of user data corruption when the filesystem which backs a sparse MAP_SHARED mapping runs out of space. We can use this code in filemap_nopage() to ensure that all mapped pages have space allocated on-disk. Deliver SIGBUS on ENOSPC. This will require a new address_space op, I expect.
-
Andrew Morton authored
This patch is a general solution to the situation where a zone is full of pinned pages. This can come about if: a) Someone has allocated all of ZONE_DMA for IO buffers b) Some application is mlocking some memory and a zone ends up full of mlocked pages (can happen on a 1G ia32 system) c) All of ZONE_HIGHMEM is pinned in hugetlb pages (can happen on 1G machines) We'll currently burn 10% of CPU in kswapd when this happens, although it is quite hard to trigger. The algorithm is: - If page reclaim has scanned 2 * the total number of pages in the zone and there have been no pages freed in that zone then mark the zone as "all unreclaimable". - When a zone is "all unreclaimable" page reclaim almost ignores it. We will perform a "light" scan at DEF_PRIORITY (typically 1/4096'th of the zone, or 64 pages) and then forget about the zone. - When a batch of pages are freed into the zone, clear its "all unreclaimable" state and start full scanning again. The assumption being that some state change has come about which will make reclaim successful again. So if a "light scan" actually frees some pages, the zone will revert to normal state immediately. So we're effectively putting the zone into "low power" mode, and lightly polling it to see if something has changed. The code works OK, but is quite hard to test - I mainly tested it by pinning all highmem in hugetlb pages.
-
Andrew Morton authored
Strengthen the `incremental min' logic in the page allocator. Currently it is allowing the allocation to succeed if the zone has free_pages >= pages_high. This was to avoid a lockup corner case in which all the zones were at pages_high so reclaim wasn't doing anything, but the incremental min refused to take pages from those zones anyway. But we want the incremental min zone protection to work. So: - Only allow the allocator to dip below the incremental min if he cannot run direct reclaim. - Change the page reclaim code so that on the direct reclaim path, the caller can free pages beyond ->pages_high. So if the incremental min test fails, the caller will go and free some more memory. Eventually, the caller will have freed enough memory for the incremental min test to pass against one of the zones.
-
Andrew Morton authored
The vm_writeback address_space operation was designed to provide the VM with a "clustered writeout" capability. It allowed the filesystem to perform more intelligent writearound decisions when the VM was trying to clean a particular page. I can't say I ever saw any real benefit from this - not much writeout actually happens on that path - quite a lot of work has gone into minimising it actually. The default ->vm_writeback a_op which I provided wrote back the pages in ->dirty_pages order. But there is one scenario in which this causes problems - writing a single 4G file with mem=4G. We end up with all of ZONE_NORMAL full of dirty pages, but all writeback effort is against highmem pages. (Because there is about 1.5G of dirty memory total). Net effect: the machine stalls ZONE_NORMAL allocation attempts until the ->dirty_pages writeback advances onto ZONE_NORMAL pages. This can be fixed most sweetly with additional radix-tree infrastructure which will be quite complex. Later. So this patch dumps it all, and goes back to using writepage against individual pages as they come off the LRU.
-
Andrew Morton authored
blk_congestion_wait() is a utility function which various callers use to throttle themselves to the rate at which the IO system can retire writes. The current implementation refuses to wait if no queues are "congested" (>75% of requests are in flight). That doesn't work if the queue is so huge that it can hold more than 40% (dirty_ratio) of memory. The queue simply cannot enter congestion because the VM refuses to allow more than 40% of memory to be dirtied. (This spin could happen with a lot of normal-sized queues too) So this patch simply changes blk_congestion_wait() to throttle even if there are no congested queues. It will cause the caller to sleep until someone puts back a write request against any queue. (Nobody uses blk_congestion_wait for read congestion). The patch adds new state to backing_dev_info->state: a couple of flags which indicate whether there are _any_ reads or writes in flight against that queue. This was added to prevent blk_congestion_wait() from taking a nap when there are no writes at all in flight. But the "are there any reads" info could be used to defer background writeout from pdflush, to reduce read-vs-write competition. We'll see. Because the large request queues have made a fundamental change: blocking in get_request_wait() has been the main form of VM throttling for years. But with large queues it doesn't work any more - all throttling happens in blk_congestion_wait(). Also, change io_schedule_timeout() to propagate the schedule_timeout() return value. I was using that in some debug code, but it should have been like that from day one.
-
Andrew Morton authored
From Roman Zippel. Don't assume that physical memory starts at physical address zero.
-
Andrew Morton authored
Patch from Stephen Tweedie "In looking at the fix for the ext3 Orlov double-accounting bug, I noticed a change to the sb->s_dir_count accounting, restoring a missing s_dir_count++ when we allocate a new directory. However, I can't find anywhere in the code where we decrement this again on directory deletion, neither in ext2 nor in ext3, in 2.4 nor in 2.5." Locking is via lock_super().
-
Andrew Morton authored
There is a warning in there to detect when block_write_full_page() attaches buffers to a blockdev page. This is a bad thing because that page's blocks may then overlap blocks from a different address_space. So I disallowed it. But the message can be triggered when an application is mmapping a blockdev MAP_SHARED. Apparently INND likes to do this. So remove the warning.
-
Andrew Morton authored
Patch from Christopher Li <chrisl@vmware.com> This little patch will fix two place in htree code which forget the "cpu_to_le16" converting . This bug causes incorrect record length on PPC. Thanks Franz for report the problem.
-
Andrew Morton authored
Patch from Andreas Gruenbacher <agruen@suse.de> The setxattr inode operation is defined like this in 2.4 and 2.5: int (*setxattr) (struct dentry *dentry, const char *name, void *value, size_t size, int flags); the original type of the value parameter was `const void *'; the const obviously has been lost at some point. The definition should be: int (*setxattr) (struct dentry *dentry, const char *name, const void *value, size_t size, int flags);
-
Andrew Morton authored
The page allocator has traditionally just gone BUG when it sees a page in a bad state. This is usually due to hardware errors, sometimes software errors. I'm proposing that we not go BUG() any more, but print lots (and lots) of diagnostic info and try to continue. Might be a bit controversial.
-
Andrew Morton authored
balance_dirty_pages() is too expensive to call once-per-page. Use the ratelimited version.
-
Andrew Morton authored
From Dipanker Sarma. Before setting the ids->entries to the new array, there must be a wmb() to make sure that the memcpyed contents of the new array are visible before the new array becomes visible.
-
Andrew Morton authored
This patch fixes a problem which was discovered by Vladimir Saveliev <vs@namesys.com> Radix trees have a `height' field, which defines how far the pages are from the root of the tree. It starts out at zero and increases as the trees depth is grown. But it is never decreased. It cannot be decreased without a full tree traversal. Because radix_tree_delete() does not decrease `height', we end up returning inodes to their filesystem's inode slab cache with a non-zero height. And when that inode is reused from slab for a new file, it still has a non-zero height. So we're breaking the slab rules by not putting objects back in a fully reinitialised state. So the new file starts out life with whatever height the previous owner of the inode had. Which is space- and speed-inefficient. The most efficient place to fix this would be in destroy_inode(). But that only fixes the problem for inodes - there are other users of radix trees. So fix it in radix_tree_delete(): if the tree was emptied, reset `height' to zero.
-
Andrew Morton authored
Patch from Hugh Dickins <hugh@veritas.com> Fixes the Oracle startup problem reported by Alessandro Suardi. Reverts a "simplification" to shmdt() which was wrong if subsequent mprotects broke up the original VMA, or if parts of it were munmapped.
-
Andrew Morton authored
- I hit a BUG in end_swap_bio_read() under heavy load. The page wasn't locked. No idea how this can happen :( Add a BUG at submission time to catch a caller reading into an unlocked swapcache page. - Remove a debug check from destroy_inode() - it was in the wrong leg of the `if' statement anyway.
-
Neil Brown authored
This allows NFSv4 responses to cover move than one page. There are still limits though. There can be at most one 'data' response which includes READ, READLINK, READDIR. For these responses, the interesting data goes in a separate page or, for READ, list of pages. All responses before the 'data' response must fit in one page, and all responses after it must also fit in one (separate) page.
-
Neil Brown authored
Now that nfsd uses a list of pages for requests instead of one large buffer, NFSv4 need to know about this. The most interesting part of this is that it is possible that section of a request, like a path name, could span two pages, so we need to be able to kmalloc as little bit of space to copy them into, and make sure they get freed later.
-
Anton Blanchard authored
We don't implement the ancient stat syscalls on ppc64 since early libcs wont run on ppc64 (they hardcode the incorrect cacheline size).
-
bk://are.twiddle.net/axp-2.5Linus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
http://linux-acpi.bkbits.net/linux-acpiLinus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
Trond Myklebust authored
Here is the a pre-patch in the attempt to get rid of 'struct nfs4_compound', and the associated horrible union in 'struct nfs4_op'. It splits out the fields that are meant to do buffer overflow checking and iovec adjusting on the XDR received/sent data. It moves support for that nto the dedicated structure 'xdr_stream', and the associated functions 'xdr_reserve_space()', 'xdr_inline_decode()'. The patch also expands out the all macros ENCODE_HEAD, ENCODE_TAIL, ADJUST_ARGS and DECODE_HEAD, as well as most of the DECODE_TAILs.
-
Tom Rini authored
linux/interrupt.h needs: asm/system.h: smb_mb() linux/linkage.h: asmlinkage/FASTCALL/etc.
-
Dominik Brodowski authored
This changes the return type of the verify and setpolicy functions from void to int. While doing this, I've changed the values for minimum and maximum supported frequency to be per CPU, as UltraSPARC needs this. Additionally, small cleanups in various drivers.
-
Ivan Kokshaysky authored
Traditional naming in pci/setup-xx code assumes that pdev_*/pbus_* functions are private, everything visible from outer world should be pci_*.
-
Stelian Pop authored
The most important changes are: - allocate buffers on open(), not module load; - correct some failed allocation paths; - use wait_event; - C99 structs inits;
-
Stelian Pop authored
The most important changes are: * add suspend/resume support to the sonypi driver (not based on driverfs however) (Florian Lohoff); * add "Zoom" and "Thumbphrase" buttons (Francois Gurin); * add camera and lid events for C1XE (Kunihiko IMAI); * add a mask parameter letting the user choose what kind of events he wants; * use ACPI ec_read/ec_write when available in order to play nice when latest ACPI is enabled; * several source cleanups.
-
bk://linux-dj.bkbits.net/watchdogLinus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
Dave Jones authored
-
Andries E. Brouwer authored
Today we return EINVAL for fcntl with a lock with negative length. POSIX-2001 says that the lock covers start .. start+len-1 if len >= 0 and start+len .. start-1 if len < 0.
-
Robert Read authored
-
Linus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
- 21 Nov, 2002 8 commits
-
-
Bart De Schuymer authored
-
Hideaki Yoshifuji authored
-
David S. Miller authored
into nuts.ninka.net:/home/davem/src/BK/net-2.5
-
David S. Miller authored
-
David S. Miller authored
-
David S. Miller authored
-
Richard Henderson authored
-
Andy Grover authored
into groveronline.com:/root/bk/linux-acpi
-