- 10 Sep, 2002 16 commits
-
-
Andrew Morton authored
The pte_chains presently consist of a pte pointer and a `next' link. So there's a 50% memory wastage here as well as potential for a lot of misses during walks of the singly-linked per-page list. This patch increases the pte_chain structure to occupy a full cacheline. There are 7, 15 or 31 pte pointers per structure rather than just one. So the wastage falls to a few percent and the number of misses during the walk is reduced. The patch doesn't make much difference in simple testing, because in those tests the pte_chain list from the previous page has good cache locality with the next page's list. The patch sped up Anton's "10,000 concurrently exitting shells" test by 3x or 4x. It gives a 10% reduction in system time for a kernel build on 16p NUMAQ. It saves memory and reduces the amount of work performed in the slab allocator. Pages which are mapped by only a single process continue to not have a pte_chain. The pointer in struct page points directly at the mapping pte (a "PageDirect" pte pointer). Once the page is shared a pte_chain is allocated and both the new and old pte pointers are moved into it. We used to collapse the pte_chain back to a PageDirect representation in page_remove_rmap(). That has been changed. That collapse is now performed inside page reclaim, via page_referenced(). The thinking here is that if a page was previously shared then it may become shared again, so leave the pte_chain structure in place. But if the system is under memory pressure then start reaping them anyway.
-
Andrew Morton authored
This patch addresses the excessive consumption of ZONE_NORMAL by buffer_heads on highmem machines. The algorithms which decide which buffers to shoot down are fairly dumb, but they only cut in on machines with large highmem:lowmem ratios and the code footprint is tiny. The buffer.c change implements the buffer_head accounting - it sets the upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL. A possible side-effect of this change is that the kernel will perform more calls to get_block() to map pages to disk. This will only be observed when a file is being repeatadly overwritten - this is the only case in which the "cached get_block result" in the buffers is useful. I did quite some testing of this back in the delalloc ext2 days, and was not able to come up with a test in which the cached get_block result was measurably useful. That's for ext2, which has a fast get_block(). A desirable side effect of this patch is that the kernel will be able to cache much more blockdev pagecache in ZONE_NORMAL, so there are more ext2/3 indirect blocks in cache, so with some workloads, less I/O will be performed. In mpage_writepage(): if the number of buffer_heads is excessive then buffers are stripped from pages as they are submitted for writeback. This change is only useful for filesystems which are using the mpage code. That's ext2 and ext3-writeback and JFS. An mpage patch for reiserfs was floating about but seems to have got lost. There is no need to strip buffers for reads because the mpage code does not attach buffers for reads. These are perhaps not the most appropriate buffer_heads to toss away. Perhaps something smarter should be done to detect file overwriting, or to toss the 'oldest' buffer_heads first. In refill_inactive(): if the number of buffer_heads is excessive then strip buffers from pages as they move onto the inactive list. This change is useful for all filesystems. This approach is good because pages which are being repeatedly overwritten will remain on the active list and will retain their buffers, whereas pages which are not being overwritten will be stripped.
-
Andrew Morton authored
Writeback parameter tuning. Somewhat experimental, but heading in the right direction, I hope. - Allowing 40% of physical memory to be dirtied on massive ia32 boxes is unreasonable. It pins too many buffer_heads and contribues to page reclaim latency. The patch changes the initial value of /proc/sys/vm/dirty_background_ratio, dirty_async_ratio and (the presently non-functional) dirty_sync_ratio so that they are reduced when the highmem:lowmem ratio exceeds 4:1. These ratios are scaled so that as the highmem:lowmem ratio goes beyond 4:1, the maximum amount of allowed dirty memory ceases to increase. It is clamped at the amount of memory which a 4:1 machine is allowed to use. - Aggressive reduction in the dirty memory threshold at which background writeback cuts in. 2.4 uses 30% of ZONE_NORMAL. 2.5 uses 40% of total memory. This patch changes it to 10% of total memory (if total memory <= 4G. Even less otherwise - see above). This means that: - Much more writeback is performed by pdflush. - When the application is generating dirty data at a moderate rate, background writeback cuts in much earlier, so memory is cleaned more promptly. - Reduces the risk of user applications getting stalled by writeback. - Will damage dbench numbers. It turns out that the damage is fairly small, and dbench isn't a worthwhile workload for optimisation. - Moderate reduction in the dirty level at which the write(2) caller is forced to perform writeback (throttling). Was 40% of total memory. Is now 30% of total memory (if total memory <= 4G, less otherwise). This is to reduce page reclaim latency, and generally because allowing processes to flood the machine with dirty data is a bad thing in mixed workloads.
-
Andrew Morton authored
Patch from Martin Bligh "This mainly just rips out some magic extra structures in the boot time code to determine node sizes, and counts in pages instead of bytes. Oh, and I put the code that allocates pgdat into allocage_pgdat, instead of find_max_pfn_node, which seems like an incongruous home for it. No functionality changes, nothing touched outside i386 discontigmem ... just makes code cleaner and more readable. Tested on 16-way NUMA-Q."
-
Andrew Morton authored
Patch from Martin Bligh. "This mainly changes the PLAT_MY_MACRO_IS_ALL_CAPS() stuff to be normal_macro(), and takes out some unnecessary redirection of function names. No functionality changes, nothing touched outside i386 discontigmem ... just makes code readable. Rumour has it that the PLAT_* stuff came from IRIX - I don't see that as a good reason to make the Linux code unreadable. Tested on 16-way NUMA-Q."
-
Andrew Morton authored
Some adjustments to global dirty page accounting. Previously, dirty page accounting counted all dirty pages. Even dirty anonymous pages. This has potential to upset the throttling logic in balance_dirty_pages(). Particularly as I suspect we should decrease the dirty memory writeback thresholds by a lot. So this patch changes it so that we only account for dirty pagecache pages which have backing store. Not anonymous pages, not swapcache, not in-memory filesystem pages. To support this, the `memory_backed' boolean has been added to struct backing_dev_info. When an address space's backing device is marked as memory-backed, the core kernel knows to not include that mapping's pages in the dirty memory accounting. For memory-backed mappings, dirtiness is a way of pinning the page, and there's nothing the kernel can to do clean the page to make it freeable. driverfs, tmpfs, and ranfs have been coverted to mark their mappings as memory-backed. The ramdisk driver hasn't been converted. I have a separate patch for ramdisk, which fails to fix the longstanding problems in there :( With this patch, /bin/sync now sends /proc/meminfo:Dirty to zero, which is rather comforting.
-
Andrew Morton authored
Restore the gfp_mask in the VM's call to a_ops->releasepage(). We can block in there again, and XFS (at least) can use that.
-
Andrew Morton authored
The patch fixes a few problems in the writer throttling code. Mainly in the situation where a single large file is being written out. That file could be parked on sb->locked_inodes due to pdflush writeback, and the writer throttling path coming out of balance_dirty_pages() forgot to look for inodes on ->locked_inodes. The net effect was that the amount of dirty memory was exceeding the limit set in /proc/sys/vm/dirty_async_ratio, possibly to the point where the system gets seriously choked. The patch removes sb->locked_inodes altogether and teaches the throttling code to look for inodes on sb->s_io as well as sb->s_dirty. Also, just leave unwritten dirty pages on mapping->io_pages, and unwritten dirty inodes on sb->s_io. Putting them back onto ->dirty_pages and ->dirty_inodes was fairly pointless, given that both lists need to be looked at.
-
Ingo Molnar authored
This fixes the lockup. The bug happened because reparenting in the CLONE_THREAD case was done in a fundamentally non-atomic way, which was asking for various races to happen: eg. the target parent gets reparented to the currently exiting thread ... (the non-CLONE_THREAD case is safe because nothing reparents init.) the solution is to make all of reparenting atomic (including the forget_original_parent() bit) - this is possible with some reorganization done in signal.c and exit.c. This also made some of the loops simpler.
-
Alexander Viro authored
devfs side fixed thus:
-
Jens Axboe authored
Update hdreg to match 2.4 levels. o Use consistent SRV_STAT instead of SERVICE_STAT o Add sector count status bits for tcq o Add various missing commands o hd_driveid update
-
Jens Axboe authored
Update IDE pci ids to match 2.4.20-pre5-ac4 levels.
-
Jens Axboe authored
Add blk_fs_request(rq) to avoid testing rq->flags & REQ_CMD directly.
-
Jens Axboe authored
This merges the changes from 2.4-ac that allow drivers to enable (and mark as used) only a subset of PCI resources, for those drivers that need it (at this point apparently only the i845 IDE controller).
-
Mikael Pettersson authored
In the 2.5.33->2.5.34 step someone removed "export-objs" from drivers/char/ftape/lowlevel/Makefile, which makes it impossible to build ftape as a module since is _does_ have a number of EXPORT_SYMBOL's. This reverts that change.
-
Mikael Pettersson authored
The 2.5 floppy driver has for a long time has two init/exit bugs: 1. It calls register_sys_device() on init, but fails to call unregister_sys_device() in exit. This leads to data structure corruption if floppy is a module and it gets unloaded. 2. If calls register_sys_device() early on init, but fails to call unregister_sys_device() if init fails. Again, this leads to data structure corruption. The patch below fixes both these problems.
-
- 09 Sep, 2002 24 commits
-
-
Stephen Rothwell authored
drivers/cdrom/cdrom.c is the only file (apart from include/linux/fcntl.h) that includes asm/fcntl.h. This changes that and should have no affect. I need to do this before I consolidate the asm/fcntl.h files into linux/fcntl.h (coming next - again).
-
Skip Ford authored
This is needed since 2.5.32 to successfully mount a UFS partition.
-
Rolf Fokkens authored
I've been playing with different HZ values in the 2.4 kernel for a while now, and apparantly Linus also has decided to introduce a USER_HZ constant (I used CLOCKS_PER_SEC) while raising the HZ value on x86 to 1000. On x86 timekeeping has shown to be relative fragile when raising HZ (OK, I tried HZ=2048 which is quite high) because of the way the interrupt timer is configured to fire HZ times each second. This is done by configuring a divisor in the timer chip (LATCH) which divides a certain clock (1193180) and makes the chip fire interrupts at the resulting frequency. Now comes the catch: NTP requires a clock accuracy of 500 ppm. For some HZ values the clock is not accurate enough to meet this requirement, hence NTP won't work well. An example HZ value is 1020 which exceeds the 500 ppm requirement. In this case the best approximation is 1019.8 Hz. the xtime.tv_usec value is raised with a value of 980 each tick which means that after one second the tv_usec value has increased with 999404 (should be 1000000) which is an accuracy of 596 ppm. Some more examples: HZ Accuracy (ppm) ---- -------------- 100 17 1000 151 1024 632 2000 687 2008 343 2011 18 2048 1249 What I've been doing is replace tv_usec by tv_nsec, meaning xtime is now a timespec instead of a timeval. This allows the accuracy to be improved by a factor of 1000 for any (well ... any?) HZ value. Of course all kinds of calculations had te be improved as well. The ACTHZ constantant is introduced to approximate the actual HZ value, it's used to do some approximations of other related values.
-
Linus Torvalds authored
Cset exclude: greg@kroah.com|ChangeSet|20020905153320|19047
-
http://linux-acpi.bkbits.net/linux-acpiLinus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
bk://linuxusb.bkbits.net/linus-2.5Linus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
bk://linuxusb.bkbits.net/pci_hp-2.5Linus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
Andy Grover authored
-
Andy Grover authored
into groveronline.com:/root/bk/linux-acpi
-
Patrick Mochel authored
during the initcall sequence, after all CPUs have been brought up. mtrr_init() calls a static init_other_cpus(), which fires off a function on all other cpus to replicate the state across all of them. arch/i386/kernel/smpboot.c::smp_callin() had the following: #ifdef CONFIG_MTRR /* * Must be done before calibration delay is computed */ mtrr_init_secondary_cpu (); #endif I couldn't figure this one out. The P4 manual says nothing about this, nor find any other documentation about it. The P4 manual says only that state must be synchronized across all CPUs, which it is. And, it happens before anything else is executed on the other CPUs, and before any devices or drivers have been brought up. The cyrix mtrr code was also updated to handle this style of SMP initialization.
-
Linus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
Linus Torvalds authored
-
Greg Kroah-Hartman authored
-
Greg Kroah-Hartman authored
-
Patrick Mochel authored
- The early startup code was changed so smp_prepare_cpus() is now called before do_basic_setup(). do_basic_setup() is where mtrr_init() is called, which mtrr_init_secondary_cpu() is dependent on being called. - mtrr_init_boot_cpu() was removed from the AP startup code. This was a SMP-only hack that made sure mtrr_init() happened when SMP was enabled. That's right - two different code paths to do the same thing, obscured by compile-time defines. The appended patch makes sure mtrr_init() is called before smp_prepare_cpus(). It's ugly, and I'll work on a cleaner solution, but James: could you try it and see if it fixes your performance issues?
-
Juan Quintela authored
Documentation/porting: s/are/and/ Documentation/directory-locking: s/that means// was repeated
-
Petr Vandrovec authored
When recalc_sigpending was converted from inline to real function, appropriate EXPORT_SYMBOL() was not created. Needed at least for ncpfs and lockd.
-
Chris Wright authored
Update kernel-api.tmpl to reflect mtrr changes so that the docs will build.
-
Greg Kroah-Hartman authored
The pci_bus_* functions should be used instead.
-
Greg Kroah-Hartman authored
-
Irene Zubarev authored
-
Irene Zubarev authored
-
Irene Zubarev authored
- fix polling logic - add ability to write [chassis/rxe]#slot# instead of just slot#
-
Andy Grover authored
into groveronline.com:/root/bk/linux-acpi
-