- 06 Feb, 2003 12 commits
-
-
Andrew Morton authored
Patch from rwhron@earthlink.net Micro-optimization of default_idle from -aa. current_cpu_data.hlt_works_ok is only false for some old 386/486 pcs.
-
Andrew Morton authored
ia32 and others can determine a page's hugeness by inspecting the pmd's value directly. No need to perform a VMA lookup against the user's virtual address. This patch ifdef's away the VMA-based implementation of hugepage-aware-follow_page for ia32 and replaces it with a pmd-based implementation. The intent is that architectures will implement one or the other. So the architecture either: 1: Implements hugepage_vma()/follow_huge_addr(), and stubs out pmd_huge()/follow_huge_pmd() or 2: Implements pmd_huge()/follow_huge_pmd(), and stubs out hugepage_vma()/follow_huge_addr()
-
Andrew Morton authored
Using a futex in a large page causes a kernel lockup in __pin_page() - because __pin_page's page revalidation uses follow_page(), and follow_page() doesn't work for hugepages. The patch fixes up follow_page() to return the appropriate 4k page for hugepages. This incurs a vma lookup for each follow_page(), which is considerable overhead in some situations. We only _need_ to do this if the architecture cannot determin a page's hugeness from the contents of the PMD. So this patch is a "reference" implementation for, say, PPC BAT-based hugepages.
-
Andrew Morton authored
Patch from "Kamble, Nitin A" <nitin.a.kamble@intel.com> Hello All, We were looking at the performance impact of the IRQ routing from the 2.5.52 Linux kernel. This email includes some of our findings about the way the interrupts are getting moved in the 2.5.52 kernel. Also there is discussion and a patch for a new implementation. Let me know what you think at nitin.a.kamble@intel.com Current implementation: ====================== We have found that the existing implementation works well on IA32 SMP systems with light load of interrupts. Also we noticed that it is not working that well under heavy interrupt load conditions on these SMP systems. The observations are: * Interrupt load of each IRQ is getting balanced on CPUs independent of load of other IRQs. Also the current implementation moves the IRQs randomly. This works well when the interrupt load is light. But we start seeing imbalance of interrupt load with existence of multiple heavy interrupt sources. Frequently multiple heavily loaded IRQs gets moved to a single CPU while other CPUs stay very lightly loaded. To achieve a good interrupts load balance, it is important to consider the load of all the interrupts together. This further can be explained with an example of 4 CPUs and 4 heavy interrupt sources. With the existing random movement approach, the chance of each of these heavy interrupt sources moving to separate CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of the time the situation is, some CPUs are very lightly loaded and some are loaded with multiple heavy interrupts. This causes the interrupt load imbalance and results in less performance. In a case of 2 CPUs and 2 heavily loaded interrupt sources, this imbalance happens 1/2 = 50% of the times. This issue becomes more and more severe with increasing number of heavy interrupt sources. * Another interesting observation is: We cannot see the imbalance of the interrupt load from /proc/interrupts. (/proc/interrupts shows the cumulative load of interrupts on all CPUs.) If the interrupt load is imbalanced and this imbalance is getting rotated among CPUs continuously, then /proc/interrupts will still show that the interrupt load is going to processors very evenly. Currently at the frequency (HZ/50) at which IRQs are moved across CPUs, it is not possible to see any interrupt load imbalance happening. * We have also found that, in certain cases the static IRQ binding performs better than the existing kernel distribution of interrupt load. The reason is, in a well-balanced interrupt load situations, these interrupts are unnecessarily getting frequently moved across CPUs. This adds an extra overhead; also it takes off the CPU cache warmth benefits. This came out from the performance measurements done on a 4-way HT (8 logical processors) Pentium 4 Xeon system running 8 copies of netperf. The 4 NICs in the system taking different IRQs generated sizable interrupt load with the help of connected clients. Here the netperf transactions/sec throughput numbers observed are: IRQs nicely manually bound to CPUs: 56.20K The current kernel implementation of IRQ movement: 50.05K ----------------------- The static binding of IRQs has performed 12.28% better than the current IRQ movement implemented in the kernel. * The current implementation does not distinguish siblings from the HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to balance the interrupt load with respect to processor packages first, and then among logical CPUs inside processor packages. For example if we have 2 heavy interrupt sources and 2 processor packages (4 logical CPUs); Assigning both the heavy interrupt sources in different processor packages is better, it will use different execution resources from the different processor packages. New revised implementation: ========================== We also have been working on a new implementation. The following points are in main focus. * At any moment heavily loaded IRQs are distributed to different CPUs to achieve as much balance as possible. * Lightly loaded interrupt sources are ignored from the load balancing, as they do not cause considerable imbalance. * When the heavy interrupt sources are balanced, they are not moved around. This also helps in keeping the CPU caches warm. * It has been made HT aware. While distributing the load, the load on a processor package to which the logical CPUs belong to is also considered. * In the situations of few (lesser than num_cpus) heavy interrupt sources, it is not possible to balance them evenly. In such case the existing code has been reused to move the interrupts. The randomness from the original code has been removed. * The time interval for redistribution has been made flexible. It varies as the system interrupt load changes. * A new kernel_thread is introduced to do the load balancing calculations for all the interrupt sources. It keeps the balanace_maps ready for interrupt handlers, keeping the overhead in the interrupt handling to minimum. * It allows the disabling of the IRQ distribution from the boot loader command line, if anybody wants to do it for any reason. * The algorithm also takes into account the static binding of interrupts to CPUs that user imposes from the /proc/irq/{n}/smp_affinity interface. Throughput numbers with the netperf setup for the new implementation: Current kernel IRQ balance implementation: 50.02K transactions/sec The new IRQ balance implementation: 56.01K transactions/sec --------------------- The performance improvement on P4 Xeon of 11.9% is observed. The new IRQ balance implementation also shows little performance improvement on P6 (Pentium II, III) systems. On a P6 system the netperf throughput numbers are: Current kernel IRQ balance implementation: 36.96K transactions/sec The new IRQ balance implementation: 37.65K transactions/sec --------------------- Here the performance improvement on P6 system of about 2% is observed. --------------------- Andrew Theurer <habanero@us.ibm.com> did some testing of this patch on a quad P4: I got a chance to run the NetBench benchmark with your patch on 2.5.54-mjb2 kernel. NetBench measures SMB/CIFS performance by using several SMB clients (in this case 44 Windows 2000 systems), sending SMB requests to a Linux server running Samba 2.2.3a+sendfile. Result is in throughput, Mbps. Generally the network traffic on the server is 60% recv, 40% tx. I believe we have very similar systems. Mine is a 4 x 1.6 GHz, 1 MB L3 P4 Xeon with 4 GB DDR memory (3.2 GB/sec I believe). The chipset is "Summit". I also have more than one Intel e1000 adapters. I decided to run a few configurations, first with just one adapter, with and without HT support in the kernel (acpi=off), then add another adapter and test again with/without HT. Here are the results: 4P, no HT, 1 x e1000, no kirq: 1214 Mbps, 4% idle 4P, no HT, 1 x e1000, kirq: 1223 Mbps, 4% idle, +0.74% I suppose we didn't see much of an improvement here because we never run into the situation where more than one interrupt with a high rate is routed to a single CPU on irq_balance. 4P, HT, 1 x e1000, no kirq: 1214 Mbps, 25% idle 4P, HT, 1 x e1000, kirq: 1220 Mbps, 30% idle, +0.49% Again, not much of a difference just yet, but lots of idle time. We may have reached the limit at which one logical CPU can process interrupts for an e1000 adapter. There are other things I can probably do to help this, like int delay, and NAPI, which I will get to eventually. 4P, HT, 2 x e1000, no kirq: 1269 Mbps, 23% idle 4P, HT, 2 x e1000, kirq: 1329 Mbps, 18% idle +4.7% OK, almost 5% better! Probably has to do with a couple of things; the fact that your code does not route two different interrupts to the same core/different logical cpus (quite obvious by looking at /proc/interrupts), and that more than one interrupt does not go to the same cpu if possible. I suspect irq_balance did some of those [bad] things some of the time, and we observed a bottleneck in int processing that was lower than with kirq. I don't think all of the idle time is because of a int processing bottleneck. I'm just not sure what it is yet :) Hopefully something will become obvious to me... Overall I like the way it works, and I believe it can be tweaked to work with NUMA when necessary. I hope to have access to a specweb system on a NUMA box soon, so we can verify that.
-
Andrew Morton authored
Patch from Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> there's a SMP race condition between __sync_single_inode (or __sync_one on 2.4.20) and __mark_inode_dirty. __mark_inode_dirty doesn't take inode spinlock. As we know -- unless you take a spinlock or use barrier, processor can change order of instructions. CPU 1 modify inode (but modifications are in cpu-local buffer and do not go to bus) calls __mark_inode_dirty it sees I_DIRTY and exits immediatelly CPU 2 takes spinlock calls __sync_single_inode inode->i_state &= ~I_DIRTY writes the inode (but does not see modifications by CPU 1 yet) CPU 1 flushes its write buffer to the bus inode is already written, clean, modifications done by CPU1 are lost The easiest fix would be to move the test inside spinlock in __mark_inode_dirty; if you do not want to suffer from performance loss, use the attached patches that use memory barriers to ensure ordering of reads and writes.
-
Andrew Morton authored
Patch from "Stephen D. Smalley" <sds@epoch.ncsc.mil> This patch restores the LSM hook calls in sendfile to 2.5.59. The hook was previously added as of 2.5.29 but the hook calls in sendfile were subsequently lost as a result of the sendfile rewrite as of 2.5.30.
-
Andrew Morton authored
Patch from Roger Gammans <roger@computer-surgery.co.uk> Adds lots of API documentation to the JBD layer.
-
Andrew Morton authored
Patch from Petr Baudis <pasky@ucw.cz> this patch (against 2.5.59) updates Documentation/kernel-parameters.txt to the (more-or-less; I certainly missed some parameters) current state of kernel. Note also that I will probably send up another update after few further kernel releases..
-
Andrew Morton authored
We don't need these with self-unplugging queues. The patch also contains a couple of microopts suggested by Andrea: we don't need to run sync_page() if the page just came unlocked.
-
Andrew Morton authored
The patch teaches a queue to unplug itself: a) if is has four requests OR b) if it has had plugged requests for 3 milliseconds. These numbers may need to be tuned, although doing so doesn't seem to make much difference. 10 msecs works OK, so HZ=100 machines will be fine. Instrumentation shows that about 5-10% of requests were started due to the three millisecond timeout (during a kernel compile). That's somewhat significant. It means that the kernel is leaving stuff in the queue, plugged, for too long. This testing was with a uniprocessor preemptible kernel, which is particularly vulnerable to unplug latency (submit some IO, get preempted before the unplug). This patch permits the removal of a lot of rather lame unplugging in page reclaim and in the writeback code, which kicks the queues (globally!) every four megabytes to get writeback underway. This patch doesn't use blk_run_queues(). It is able to kick just the particular queue. The patch is not expected to make much difference really, except for AIO. AIO needs a blk_run_queues() in its io_submit() call. For each request. This means that AIO has to disable plugging altogether, unless something like this patch does it for it. It means that AIO will unplug *all* queues in the machine for every io_submit(). Even against a socket! This patch was tested by disabling blk_run_queues() completely. The system ran OK. The 3 milliseconds may be too long. It's OK for the heavy writeback code, but AIO may want less. Or maybe AIO really wants zero (ie: disable plugging). If that is so, we need new code paths by which AIO can communicate the "immediate unplug" information - a global unplug is not good. To minimise unplug latency due to user CPU load, this patch gives keventd `nice -10'. This is of course completely arbitrary. Really, I think keventd should be SCHED_RR/MAX_RT_PRIO-1, as it has been in -aa kernels for ages.
-
Andrew Morton authored
Patch from Chris Mason <mason@suse.com> The patch below is against 2.5.59, various forms have been floating around for a while, and Andrew recently included this fixed version in 2.5.55-mm. The end result is faster reads and writes for reiserfs. This adds reiserfs support for readpages, along with a support func in fs/mpage.c to deal with the reiserfs_get_block call sending back up to date buffers with packed tails copied into them. Most of the changes are to reiserfs_writepage, which still had many 2.4isms in the way it started io, dealt with errors and handled the bh state bits. I've also added an optimization so it only starts transactions when we need to copy a packed tail into the btree or fill a hole, instead of any time reiserfs_writepage hits an unmapped buffer.
-
Andrew Morton authored
Patch from Gerd Knorr <kraxel@bytesex.org> bttv requires CONFIG_SOUND.
-
- 05 Feb, 2003 4 commits
-
-
http://jdike.stearns.org:5000/fixes-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
http://jdike.stearns.org:5000/stack-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
http://jdike.stearns.org:5000/skas-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
http://jdike.stearns.org:5000/updates-2.5Linus Torvalds authored
into home.transmeta.com:/home/torvalds/v2.5/linux
-
- 07 Feb, 2003 3 commits
-
-
Greg Kroah-Hartman authored
-
Greg Kroah-Hartman authored
into kroah.com:/home/greg/linux/BK/pci-hp-2.5
-
Greg Kroah-Hartman authored
-
- 06 Feb, 2003 14 commits
-
-
Greg Kroah-Hartman authored
-
Greg Kroah-Hartman authored
-
Randy Dunlap authored
Here's the memory leaks patch for acpiphp_glue.c.
-
Greg Kroah-Hartman authored
-
Daniel Stekloff authored
The patch is modeled after your method for creating files for usb. It makes a single file for pci sysfs files (except for pool, which I haven't touched yet). It also exposes more pci information to User Space through sysfs. Finally, it removes the dependence on the proc pci code for sysfs files.
-
Randy Dunlap authored
Fixes problems found by the CHECKER program in the pci hotplug drivers
-
Greg Kroah-Hartman authored
Moved functions from drivers/hotplug/pci_hotplug_util.c to drivers/pci/hotplug.c, which is a better place for them.
-
Greg Kroah-Hartman authored
Relies on sysfs_update_file() to be present in the kernel.
-
Greg Kroah-Hartman authored
-
Greg Kroah-Hartman authored
-
Stanley Wang authored
Here is a little patch that remove procfs stuff in pci_hotplug_core.c Remove /proc entry for pci_hotplug_core.
-
Stanley Wang authored
-
Greg Kroah-Hartman authored
-
Greg Kroah-Hartman authored
These were pointed out by "dan carpenter" <error27@email.com> from his smatch tool.
-
- 05 Feb, 2003 7 commits
-
-
Andrew Morton authored
-
http://mdomsch.bkbits.net/linux-2.5-eddLinus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
Russell King authored
Ok, here's my proposed fix, which appears to work with preempt. I haven't tested on non-preempt, nor (obviously since its from me) SMP. However, I forsee no problems caused by this change. release_dev() sets filp->private_data to NULL when the tty layer has done with the file descriptor. However, it remains on the tty_files list until __fput completes.
-
http://linux-acpi.bkbits.net/linux-acpiLinus Torvalds authored
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
-
Stephen Hemminger authored
Add "seqlock" infrastructure for doing low-overhead optimistic reader locks (writer increments a sequence number, reader verifies that no writers came in during the critical region, and lots of careful memory barriers to take care of business). Make xtime/get_jiffies_64() use this new locking.
-
Chris Wright authored
Trivial patch from Randy Dunlap <rddunlap@osdl.org> removes some useless error/retval assignments.
-
Stephen Hemminger authored
Found by inspection of of the x86_64 gettimeofday. The problem is that the code always records the maximum value but it is not reset on the next clock tick. As written, I see it keeping the maximum number of microseconds since the last clock tick.
-