Commits · 9dc8af8046afd8c05588b0ec338e6f3358ce40c9 · nexedi / linux

10 Sep, 2002 16 commits

[PATCH] rmap pte_chain speedup and space saving · 9dc8af80

Andrew Morton authored Sep 09, 2002

The pte_chains presently consist of a pte pointer and a `next' link.
So there's a 50% memory wastage here as well as potential for a lot of
misses during walks of the singly-linked per-page list.

This patch increases the pte_chain structure to occupy a full
cacheline.  There are 7, 15 or 31 pte pointers per structure rather
than just one.  So the wastage falls to a few percent and the number of
misses during the walk is reduced.

The patch doesn't make much difference in simple testing, because in
those tests the pte_chain list from the previous page has good cache
locality with the next page's list.

The patch sped up Anton's "10,000 concurrently exitting shells" test by
3x or 4x.  It gives a 10% reduction in system time for a kernel build
on 16p NUMAQ.

It saves memory and reduces the amount of work performed in the slab
allocator.

Pages which are mapped by only a single process continue to not have a
pte_chain.  The pointer in struct page points directly at the mapping
pte (a "PageDirect" pte pointer).  Once the page is shared a pte_chain
is allocated and both the new and old pte pointers are moved into it.

We used to collapse the pte_chain back to a PageDirect representation
in page_remove_rmap().  That has been changed.  That collapse is now
performed inside page reclaim, via page_referenced().  The thinking
here is that if a page was previously shared then it may become shared
again, so leave the pte_chain structure in place.  But if the system is
under memory pressure then start reaping them anyway.

9dc8af80

[PATCH] buffer_head takedown for bighighmem machines · e182d612

Andrew Morton authored Sep 09, 2002

This patch addresses the excessive consumption of ZONE_NORMAL by
buffer_heads on highmem machines. The algorithms which decide which
buffers to shoot down are fairly dumb, but they only cut in on machines
with large highmem:lowmem ratios and the code footprint is tiny.

The buffer.c change implements the buffer_head accounting - it sets the
upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL.

A possible side-effect of this change is that the kernel will perform
more calls to get_block() to map pages to disk. This will only be
observed when a file is being repeatadly overwritten - this is the only
case in which the "cached get_block result" in the buffers is useful.

I did quite some testing of this back in the delalloc ext2 days, and
was not able to come up with a test in which the cached get_block
result was measurably useful. That's for ext2, which has a fast
get_block().

A desirable side effect of this patch is that the kernel will be able
to cache much more blockdev pagecache in ZONE_NORMAL, so there are more
ext2/3 indirect blocks in cache, so with some workloads, less I/O will
be performed.

In mpage_writepage(): if the number of buffer_heads is excessive then
buffers are stripped from pages as they are submitted for writeback.
This change is only useful for filesystems which are using the mpage
code. That's ext2 and ext3-writeback and JFS. An mpage patch for
reiserfs was floating about but seems to have got lost.

There is no need to strip buffers for reads because the mpage code does
not attach buffers for reads.

These are perhaps not the most appropriate buffer_heads to toss away.
Perhaps something smarter should be done to detect file overwriting, or
to toss the 'oldest' buffer_heads first.

In refill_inactive(): if the number of buffer_heads is excessive then
strip buffers from pages as they move onto the inactive list. This
change is useful for all filesystems. This approach is good because
pages which are being repeatedly overwritten will remain on the active
list and will retain their buffers, whereas pages which are not being
overwritten will be stripped.

e182d612

[PATCH] reduce the default dirty memory thresholds · ce92adf3

Andrew Morton authored Sep 09, 2002

Writeback parameter tuning.  Somewhat experimental, but heading in the
right direction, I hope.

- Allowing 40% of physical memory to be dirtied on massive ia32 boxes
  is unreasonable.  It pins too many buffer_heads and contribues to
  page reclaim latency.

  The patch changes the initial value of
  /proc/sys/vm/dirty_background_ratio, dirty_async_ratio and (the
  presently non-functional) dirty_sync_ratio so that they are reduced
  when the highmem:lowmem ratio exceeds 4:1.

  These ratios are scaled so that as the highmem:lowmem ratio goes
  beyond 4:1, the maximum amount of allowed dirty memory ceases to
  increase.  It is clamped at the amount of memory which a 4:1 machine
  is allowed to use.

- Aggressive reduction in the dirty memory threshold at which
  background writeback cuts in.  2.4 uses 30% of ZONE_NORMAL.  2.5 uses
  40% of total memory.  This patch changes it to 10% of total memory
  (if total memory <= 4G.  Even less otherwise - see above).

This means that:

- Much more writeback is performed by pdflush.

- When the application is generating dirty data at a moderate
  rate, background writeback cuts in much earlier, so memory is
  cleaned more promptly.

- Reduces the risk of user applications getting stalled by writeback.

- Will damage dbench numbers.  It turns out that the damage is
  fairly small, and dbench isn't a worthwhile workload for
  optimisation.

- Moderate reduction in the dirty level at which the write(2) caller
  is forced to perform writeback (throttling).  Was 40% of total
  memory.  Is now 30% of total memory (if total memory <= 4G, less
  otherwise).

This is to reduce page reclaim latency, and generally because
allowing processes to flood the machine with dirty data is a bad
thing in mixed workloads.

ce92adf3

[PATCH] discontigmem code cleanup #2 · e2f5e334

Andrew Morton authored Sep 09, 2002

Patch from Martin Bligh

"This mainly just rips out some magic extra structures in the boot time
 code to determine node sizes, and counts in pages instead of bytes.
 Oh, and I put the code that allocates pgdat into allocage_pgdat,
 instead of find_max_pfn_node, which seems like an incongruous home for
 it.

 No functionality changes, nothing touched outside i386 discontigmem ...
 just makes code cleaner and more readable.  Tested on 16-way NUMA-Q."

e2f5e334

[PATCH] discontigmem code cleanup #1 · 79a96230

Andrew Morton authored Sep 09, 2002

Patch from Martin Bligh.

"This mainly changes the PLAT_MY_MACRO_IS_ALL_CAPS() stuff to be
 normal_macro(), and takes out some unnecessary redirection of function
 names.  No functionality changes, nothing touched outside i386
 discontigmem ...  just makes code readable.  Rumour has it that the
 PLAT_* stuff came from IRIX - I don't see that as a good reason to make
 the Linux code unreadable.  Tested on 16-way NUMA-Q."

79a96230

[PATCH] exact dirty state accounting · 1f90eedd

Andrew Morton authored Sep 09, 2002

Some adjustments to global dirty page accounting.

Previously, dirty page accounting counted all dirty pages.  Even dirty
anonymous pages.  This has potential to upset the throttling logic in
balance_dirty_pages().  Particularly as I suspect we should decrease
the dirty memory writeback thresholds by a lot.

So this patch changes it so that we only account for dirty pagecache
pages which have backing store.  Not anonymous pages, not swapcache,
not in-memory filesystem pages.

To support this, the `memory_backed' boolean has been added to struct
backing_dev_info.  When an address space's backing device is marked as
memory-backed, the core kernel knows to not include that mapping's
pages in the dirty memory accounting.

For memory-backed mappings, dirtiness is a way of pinning the page, and
there's nothing the kernel can to do clean the page to make it freeable.

driverfs, tmpfs, and ranfs have been coverted to mark their mappings as
memory-backed.

The ramdisk driver hasn't been converted.  I have a separate patch for
ramdisk, which fails to fix the longstanding problems in there :(

With this patch, /bin/sync now sends /proc/meminfo:Dirty to zero, which
is rather comforting.

1f90eedd

[PATCH] pass the correct flags to aops->releasepage() · 6a0fb424

Andrew Morton authored Sep 09, 2002

Restore the gfp_mask in the VM's call to a_ops->releasepage().  We can
block in there again, and XFS (at least) can use that.

6a0fb424

[PATCH] writer throttling fix · 95b88300

Andrew Morton authored Sep 09, 2002

The patch fixes a few problems in the writer throttling code.  Mainly
in the situation where a single large file is being written out.

That file could be parked on sb->locked_inodes due to pdflush
writeback, and the writer throttling path coming out of
balance_dirty_pages() forgot to look for inodes on ->locked_inodes.

The net effect was that the amount of dirty memory was exceeding the
limit set in /proc/sys/vm/dirty_async_ratio, possibly to the point
where the system gets seriously choked.

The patch removes sb->locked_inodes altogether and teaches the
throttling code to look for inodes on sb->s_io as well as sb->s_dirty.

Also, just leave unwritten dirty pages on mapping->io_pages, and
unwritten dirty inodes on sb->s_io.  Putting them back onto
->dirty_pages and ->dirty_inodes was fairly pointless, given that both
lists need to be looked at.

95b88300

[PATCH] Re: do_syslog/__down_trylock lockup in current BK · 0d8b3b44

Ingo Molnar authored Sep 09, 2002

This fixes the lockup.

The bug happened because reparenting in the CLONE_THREAD case was done in
a fundamentally non-atomic way, which was asking for various races to
happen: eg. the target parent gets reparented to the currently exiting
thread ...

(the non-CLONE_THREAD case is safe because nothing reparents init.)

the solution is to make all of reparenting atomic (including the
forget_original_parent() bit) - this is possible with some reorganization
done in signal.c and exit.c. This also made some of the loops simpler.

0d8b3b44

[PATCH] Missing IDE partition 3 of 3 on 2.5.34 · 8fb345bd
Alexander Viro authored Sep 09, 2002
```
devfs side fixed thus:
```
8fb345bd

[PATCH] hdreg command updates etc · f1c84a2e

Jens Axboe authored Sep 09, 2002

Update hdreg to match 2.4 levels.

o Use consistent SRV_STAT instead of SERVICE_STAT
o Add sector count status bits for tcq
o Add various missing commands
o hd_driveid update

f1c84a2e

[PATCH] IDE pci ids · 8930eafc
Jens Axboe authored Sep 09, 2002
```
Update IDE pci ids to match 2.4.20-pre5-ac4 levels.
```
8930eafc
[PATCH] blk_fs_request() · 4372b607
Jens Axboe authored Sep 09, 2002
```
Add blk_fs_request(rq) to avoid testing rq->flags & REQ_CMD directly.
```
4372b607

[PATCH] PCI individual resource handling · e47901f9

Jens Axboe authored Sep 09, 2002

This merges the changes from 2.4-ac that allow drivers to enable (and
mark as used) only a subset of PCI resources, for those drivers that
need it (at this point apparently only the i845 IDE controller).

e47901f9

[PATCH] undo 2.5.34 ftape damage · ac9c060c

Mikael Pettersson authored Sep 09, 2002

In the 2.5.33->2.5.34 step someone removed "export-objs" from
drivers/char/ftape/lowlevel/Makefile, which makes it impossible to build
ftape as a module since is _does_ have a number of EXPORT_SYMBOL's.

This reverts that change.

ac9c060c

[PATCH] 2.5.34 floppy driver init/exit fixes · 9d1f9419

Mikael Pettersson authored Sep 09, 2002

The 2.5 floppy driver has for a long time has two init/exit bugs:
1. It calls register_sys_device() on init, but fails to call
   unregister_sys_device() in exit. This leads to data structure
   corruption if floppy is a module and it gets unloaded.
2. If calls register_sys_device() early on init, but fails to call
   unregister_sys_device() if init fails. Again, this leads to
   data structure corruption.

The patch below fixes both these problems.

9d1f9419

09 Sep, 2002 24 commits

[PATCH] cdrom.c is the only file to include asm/fcntl.h · ed245b59

Stephen Rothwell authored Sep 09, 2002

drivers/cdrom/cdrom.c is the only file (apart from include/linux/fcntl.h)
that includes asm/fcntl.h. This changes that and should have no affect.

I need to do this before I consolidate the asm/fcntl.h files into
linux/fcntl.h (coming next - again).

ed245b59

[PATCH] 2.5.34 ufs/super.c · 2ecc1c29
Skip Ford authored Sep 09, 2002
```
This is needed since 2.5.32 to successfully mount a UFS partition.
```
2ecc1c29

[PATCH] USER_HZ & NTP problems · 3843e047

Rolf Fokkens authored Sep 09, 2002

I've been playing with different HZ values in the 2.4 kernel for a while
now, and apparantly Linus also has decided to introduce a USER_HZ
constant (I used CLOCKS_PER_SEC) while raising the HZ value on x86 to
1000.

On x86 timekeeping has shown to be relative fragile when raising HZ (OK,
I tried HZ=2048 which is quite high) because of the way the interrupt
timer is configured to fire HZ times each second.  This is done by
configuring a divisor in the timer chip (LATCH) which divides a certain
clock (1193180) and makes the chip fire interrupts at the resulting
frequency.

Now comes the catch: NTP requires a clock accuracy of 500 ppm.  For some
HZ values the clock is not accurate enough to meet this requirement,
hence NTP won't work well.

An example HZ value is 1020 which exceeds the 500 ppm requirement.  In
this case the best approximation is 1019.8 Hz.  the xtime.tv_usec value
is raised with a value of 980 each tick which means that after one
second the tv_usec value has increased with 999404 (should be 1000000)
which is an accuracy of 596 ppm.

Some more examples:
	  HZ Accuracy (ppm)
	---- --------------
	 100             17
	1000            151
	1024            632
	2000            687
	2008            343
	2011             18
	2048           1249

What I've been doing is replace tv_usec by tv_nsec, meaning xtime is now
a timespec instead of a timeval.  This allows the accuracy to be
improved by a factor of 1000 for any (well ...  any?) HZ value. 

Of course all kinds of calculations had te be improved as well.  The
ACTHZ constantant is introduced to approximate the actual HZ value, it's
used to do some approximations of other related values.

3843e047

Never _ever_ BUG() if you don't have to · ba815d85
Linus Torvalds authored Sep 09, 2002
```
Cset exclude: greg@kroah.com|ChangeSet|20020905153320|19047
```
ba815d85
Merge http://linux-acpi.bkbits.net/linux-acpi · 38908d74
Linus Torvalds authored Sep 09, 2002
```
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
```
38908d74
Merge bk://linuxusb.bkbits.net/linus-2.5 · 8a0f08e2
Linus Torvalds authored Sep 09, 2002
```
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
```
8a0f08e2
Merge bk://linuxusb.bkbits.net/pci_hp-2.5 · 159b0104
Linus Torvalds authored Sep 09, 2002
```
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
```
159b0104
ACPI: Fix possible sleeping at interrupt context (Matthew Wilcox) · 03e691ed
Andy Grover authored Sep 09, 2002

03e691ed
Merge groveronline.com:/root/bk/linux-2.5 · a340bf30
Andy Grover authored Sep 09, 2002
```
into groveronline.com:/root/bk/linux-acpi
```
a340bf30

Reorganize the mtrr init sequence a bit. All mtrr init now happens · b6a3d01f

Patrick Mochel authored Sep 09, 2002

during the initcall sequence, after all CPUs have been brought up. 
mtrr_init() calls a static init_other_cpus(), which fires off a function 
on all other cpus to replicate the state across all of them. 

arch/i386/kernel/smpboot.c::smp_callin() had the following: 

#ifdef CONFIG_MTRR
       /*
        * Must be done before calibration delay is computed
        */
       mtrr_init_secondary_cpu ();
#endif


I couldn't figure this one out. The P4 manual says nothing about this, nor
find any other documentation about it. The P4 manual says only that state
must be synchronized across all CPUs, which it is. And, it happens before
anything else is executed on the other CPUs, and before any devices or
drivers have been brought up.

The cyrix mtrr code was also updated to handle this style of SMP initialization.

b6a3d01f

Merge home:v2.5/linux · 2b5d7502

Linus Torvalds authored Sep 09, 2002

into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux

2b5d7502

Get Intel model name from the CPU · ad6b7f70
Linus Torvalds authored Sep 09, 2002

ad6b7f70
IBM PCI Hotplug driver: changed calls to pci_*_nodev() to pci_bus_*() · 441a1393
Greg Kroah-Hartman authored Sep 09, 2002

441a1393
Compaq PCI Hotplug driver: changed calls to pci_*_nodev() to pci_bus_*() · 82475a64
Greg Kroah-Hartman authored Sep 09, 2002

82475a64

[PATCH] Re: Performance issue in 2.5.32+ · ac7349b6

Patrick Mochel authored Sep 09, 2002

- The early startup code was changed so smp_prepare_cpus() is now called
  before do_basic_setup().  do_basic_setup() is where mtrr_init() is
  called, which mtrr_init_secondary_cpu() is dependent on being called.

- mtrr_init_boot_cpu() was removed from the AP startup code. This was a
  SMP-only hack that made sure mtrr_init() happened when SMP was
  enabled.  That's right - two different code paths to do the same
  thing, obscured by compile-time defines.

The appended patch makes sure mtrr_init() is called before
smp_prepare_cpus(). It's ugly, and I'll work on a cleaner solution, but
James: could you try it and see if it fixes your performance issues?

ac7349b6

[PATCH] : Grammatical fixes · 4b84bbe0

Juan Quintela authored Sep 09, 2002

  Documentation/porting: s/are/and/
  Documentation/directory-locking: s/that means// was repeated

4b84bbe0

[PATCH] 2.5.34: recalc_sigpending missing for modules · 69be6c8e

Petr Vandrovec authored Sep 09, 2002

When recalc_sigpending was converted from inline to real function,
appropriate EXPORT_SYMBOL() was not created.  Needed at least for ncpfs
and lockd.

69be6c8e

[PATCH] 2.5.34 kernel-api DocBook fix · 2f5d3153
Chris Wright authored Sep 09, 2002
```
Update kernel-api.tmpl to reflect mtrr changes so that the docs will build.
```
2f5d3153
PCI Hotplug: remove pci_*_nodev() prototypes as the functions are gone. · 3d1a6602
Greg Kroah-Hartman authored Sep 09, 2002
```
The pci_bus_* functions should be used instead.
```
3d1a6602
PCI: export pci_scan_bus() as the IBM PCI Hotplug driver needs it. · 5be7fa58
Greg Kroah-Hartman authored Sep 09, 2002

5be7fa58
[PATCH] IBM PCI Hotplug driver update for PCI based controllers · ef7f120a
Irene Zubarev authored Sep 09, 2002

ef7f120a
[PATCH] IBM PCI Hotplug driver update for ISA based controllers · 9adaeddf
Irene Zubarev authored Sep 09, 2002

9adaeddf
[PATCH] IBM PCI Hotplug driver update · af6e9e07
Irene Zubarev authored Sep 09, 2002
```
- fix polling logic
- add ability to write [chassis/rxe]#slot# instead of just slot#
```
af6e9e07
Merge groveronline.com:/root/bk/linux-2.5 · df9cf6c8
Andy Grover authored Sep 09, 2002
```
into groveronline.com:/root/bk/linux-acpi
```
df9cf6c8