Commits · dcdc0bd836f8cf2ed3160e8a7036e18ff73f7e98 · Kirill Smelkov / linux

30 Oct, 2002 40 commits

Port 0.8.50 acl-xattr patch to 2.5 (harmonize header file with SGI/XFS) · dcdc0bd8
Theodore Y. Ts'o authored Oct 30, 2002
```
This patch provides converts extended attributes passed in from user
space to a generic Posix ACL representation.
```
dcdc0bd8

Port of 0.8.50 acl-ms-posixacl patch to 2.5 · 14183fd4

Theodore Y. Ts'o authored Oct 30, 2002

This patch (as well as the previous one) implements core ACL support
which is needed for XFS as well as ext2/3 ACL support. It causes umask
handling to be skilled for inodes that contain POSIX acl's, so that the
original mode information can be passed down to the low-level fs code,
which will take care of handling the umask.

14183fd4

Port of 0.8.50 acl patch to 2.5 · 762b1b86

Theodore Y. Ts'o authored Oct 30, 2002

This patch (as well as the following two) implements core ACL support.
This set of convenience functions is used by the ext2/3 filesystem,
and may be useful to other filesystems that wish to use "struct posix_acl"
as their internal representation of acl's. User mode tools which
support this interface may be found at http://acl.bestbits.at

762b1b86

Port of (bugfixed) 0.8.50 xattr-ext2 to 2.5 (w/ hch cleanups. mbcache API) · 8603affb

Theodore Y. Ts'o authored Oct 30, 2002

This patch adds extended attribute support to the ext2 filesystem. This
uses the generic extended attribute patch which was developed by Andreas
Gruenbacher and the XFS team. As a result, the user space utilities
which work for XFS will also work with these patches.

8603affb

Port of (bugfixed) 0.8.50 xattr-ext3 to 2.5 (w/ hch cleanups. mbcache API) · f7cfad91

Theodore Y. Ts'o authored Oct 30, 2002

This patch adds extended attribute support to the ext3 filesystem. This
uses the generic extended attribute patch which was developed by Andreas
Gruenbacher and the XFS team. As a result, the user space utilities
which work for XFS will also work with these patches.

f7cfad91

Port of the 0.8.50 xattr-mbcache patch to 2.5. (Shrinker API, hch cleanups) · 7cbc2add

Theodore Y. Ts'o authored Oct 30, 2002

(now uses struct block_device * to index devices, and uses hash.h for hash function)

This patch creates a meta block cache which is utilized by the ext3 and
ext2 extended attribute patch (patches 2 and 3, respectively).  This
cache allows directory blocks to be indexed by multiple keys.  In the
case of the extended attribute patches, it is used to look up blocks by
both the block number and by the hash of the extended attributes.  This
is extremely important to allow the sharing of acl's when stored as
extended attributes.  Otherwise every single file would require its own,
separate, one block overhead to store then ACL, even though there might
be a large number of files that have the same ACL.

7cbc2add

Ext2/3 forward compatibility: inode size · 216114b9

Theodore Y. Ts'o authored Oct 30, 2002

This patch allows filesystems with expanded inodes to be mounted.
(compatibility feature flags will be used to control whether or not the
filesystem should be mounted in case the new inode fields will result in
compatibility issues).  This allows for future compatibility with newer
versions of ext2fs.

216114b9

Ext2/3 forward compatibility: on-line resizing · 1142d28b

Theodore Y. Ts'o authored Oct 30, 2002

This patch allows forward compatibility with future filesystems which
are dynamically grown by using an alternate algorithm for storing the
block group descriptors.  It's also a bit more efficient, in that it
uses just a little bit less disk space.  Currently, the ext2 filesystem
format requires either relocating the inode table, or reserving space in
before doing the on-line resize.  The new scheme, which is documented in
"Planned Extensions to the Ext2/3 Filesystem", by Stephen Tweedie and I 
(see: http://e2fsprogs.sourceforge.net/extensions-ext23)

1142d28b

Default mount options from superblock for ext2/3 filesystems · 841d9227

Theodore Y. Ts'o authored Oct 30, 2002

This patch adds support for default mount options to be stored in the
superblock, so they don't have to be specified on the mount command line
(or in /etc/fstab).  While I was in the code, I also cleaned up the
handling of how mount options are processed in the ext2 and ext3
filesystems.

Most mount options are now processed *after* the superblock has been
read in.  This allows for a much cleaner handling of those default mount
option parameters that were already stored in the superblock: the
resuid, resgid, and s_errors fields were handled using some fairly gross
special cases.  Now the only mount option which is processed first is
the sb option, which specifies the location of the superblock.  This
allows the handling of all of the default mount parameters to be much
more cleanly and more generally handled.

This does change the behaviour from earlier kernels, in that if the sb
mount option is specified, it must be specified *first*.  However, this
option is rarely used, and if it is, it generally is specified first, so
this seems to be a reasonable restriction.

841d9227

Linux v2.5.45. For real this time. · b1b782f7
Linus Torvalds authored Oct 30, 2002

b1b782f7
Merge master.kernel.org:/home/davem/BK/net-2.5 · dc85a09d
Linus Torvalds authored Oct 30, 2002
```
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
```
dc85a09d

[PATCH] kNFSd: Convert nfsd to use a list of pages instead of one big buffer · a0e7d495

Neil Brown authored Oct 30, 2002

This means:
  1/ We don't need an order-4 allocation for each nfsd that starts
  2/ We don't need an order-4 allocation in skb_linearize when
     we receive a 32K write request
  3/ It will be easier to incorporate the zero-copy read changes

The pages are handed around using an xdr_buf (instead of svc_buf)
much like the NFS client so future crypto code can use the same
data structure for both client and server.

The code assumes that most requests and replies fit in a single page.
The exceptions are assumed to have some largish 'data' bit, and the
rest must fit in a single page.
The 'data' bits are file data, readdir data, and symlinks.
There must be only one 'data' bit per request.
This is all fine for nfs/nlm.

This isn't complete:
  1/ NFSv4 hasn't been converted yet (it won't compile)
  2/ NFSv3 allows symlinks upto 4096, but the code will only support
     upto about 3800 at the moment
  3/ readdir responses are limited to about 3800.

but I thought that patch was big enough, and the rest can come
later.


This patch introduces vfs_readv and vfs_writev as parallels to
vfs_read and vfs_write.  This means there is a fair bit of
duplication in read_write.c that should probably be tidied up...

a0e7d495

[PATCH] kNFSd: nfsd_readdir changes. · 335c5fc7

Neil Brown authored Oct 30, 2002

nfsd_readdir - the common readdir code for all version of nfsd,
contains a number of version-specific things with appropriate checks,
and also does some xdr-encoding which rightly belongs elsewhere.

This patch simplifies nfsd_readdir to do just the core stuff, and moves
the version specifics into version specific files, and the xdr encoding
into xdr encoding files.

335c5fc7

[PATCH] kNFSd: Fix problem with buffer length with rpc/tcp · f319e5fa

Neil Brown authored Oct 30, 2002

I forgot to add '1' for the record-length header in RPC/TCP.
 Thanks to  Hirokazu Takahashi <taka@valinux.co.jp>

f319e5fa

[PATCH] kNFSd: Make sure export_open cleans up on failure. · 988d8f66

Neil Brown authored Oct 30, 2002

Currently if the kmalloc in exports_open fails,
the seq_file isn't seq_released.

We now do the kmalloc first, and make sure to kfree
if seq_open fails.

988d8f66

[PATCH] kNFSd: Fix nfs shutdown problem. · b9d189e5

Neil Brown authored Oct 30, 2002

The 'unexport everything' that happens when the
last nfsd thread dies was shuting down too much -
things that should only be shut down on module unload.

b9d189e5

[PATCH] Remove sole CONFIG_MULIQUAD in kernel source · 23518c21

Matthew Dobson authored Oct 30, 2002

There is one remaining instance of CONFIG_MULTIQUAD in the kernel source.

Fix it to use the proper CONFIG_X86_NUMAQ instead.

23518c21

[PATCH] md: factor out MD superblock handling code · d571b483

Neil Brown authored Oct 30, 2002

Define an interface for interpreting and updating superblocks
so we can more easily define new formats.

With this patch, (almost) all superblock layout information is
locating in a small set of routines dedicated to superblock
handling.  This will allow us to provide a similar set for
a different format.

The two exceptions are:
 1/ autostart_array where the devices listed in the superblock
    are searched for.
 2/ raid5 'knows' the maximum number of devices for
     compute_parity.

These will be addressed in a later patch.

d571b483

Merge · 6932d2d5
Linus Torvalds authored Oct 29, 2002

6932d2d5

[PATCH] x86-64 updates for 2.5.44 · d05e5732

Andi Kleen authored Oct 29, 2002

A few updates for x86-64 in 2.5.44. Some of the bugs fixed were serious.

- Don't count ACPI mappings in end_pfn. This shrinks mem_map a lot
  on many setups.
- Fix mem= option. Remove custom mapping support.
- Revert per_cpu implementation to the generic version. The optimized one
  that used %gs directly triggered too many toolkit problems and was an
  constant source of bugs.
- Make sure pgd_offset_k works correctly for vmalloc mappings. This makes
  modules work again properly.
- Export pci dma symbols
- Export other symbols to make more modules work
- Don't drop physical address bits >32bit on iommu free.
- Add more prototypes to fix warnings
- Resync pci subsystem with i386
- Fix pci dma kernel option parsing.
- Do PCI peer bus scanning after ACPI in case it missed some busses
  (that's a workaround - 2.5 ACPI seems to have some problems here that
  I need to investigate more closely)
- Remove the .eh_frame on linking. This saves several hundred KB in the
  bzImage
- Fix MTRR initialization. It works properly now on SMP again.
- Fix kernel option parsing, it was broken by section name changes in
  init.h
- A few other cleanups and fixes.
- Fix nonatomic warning in ioport.c

d05e5732

[PATCH] hot-n-cold pages: free and allocate hints · 8d6282a1

Andrew Morton authored Oct 29, 2002

Add a `cold' hint to struct pagevec, and teach truncate and page
reclaim to use it.

Empirical testing showed that truncate's pages tend to be hot.  And page
reclaim's are certainly cold.

8d6282a1

[PATCH] hot-n-cold pages: use cold pages for readahead · 5019ce29

Andrew Morton authored Oct 29, 2002

It is usually the case that pagecache reads use busmastering hardware
to transfer the data into pagecache.  This invalidates the CPU cache of
the pagecache pages.

So use cache-cold pages for pagecache reads.  To avoid wasting
cache-hot pages.

5019ce29

[PATCH] hot-n-cold pages: page allocator core · a206231b
Andrew Morton authored Oct 29, 2002
```
Hot/Cold pages and zone->lock amortisation
```
a206231b

[PATCH] hot-n-cold pages: bulk page freeing · 1d2652dd

Andrew Morton authored Oct 29, 2002

Patch from Martin Bligh.

Implements __free_pages_bulk().  Release multiple pages of a given
order into the buddy all within a single acquisition of the zone lock.

This also removes current->local_pages.  The per-task list of pages
which only ever contained one page.  To prevent other tasks from
stealing pages which this task has just freed up.

Given that we're freeing into the per-cpu caches, and that those are
multipage caches, and the cpu-stickiness of the scheduler, I think
current->local_pages is no longer needed.

1d2652dd

[PATCH] hot-n-cold pages: bulk page allocator · 38e419f5

Andrew Morton authored Oct 29, 2002

This is the hot-n-cold-pages series.  It introduces a per-cpu lockless
LIFO pool in front of the page allocator.  For three reasons:

1: To reduce lock contention on the buddy lock: we allocate and free
   pages in, typically, 16-page chunks.

2: To return cache-warm pages to page allocation requests.

3: As infrastructure for a page reservation API which can be used to
   ensure that the GFP_ATOMIC radix-tree node and pte_chain allocations
   cannot fail.  That code is not complete, and does not absolutely
   require hot-n-cold pages.  It'll work OK though.

We add two queues per CPU.  The "hot" queue contains pages which the
freeing code thought were likely to be cache-hot.  By default, new
allocations are satisfied from this queue.

The "cold" queue contains pages which the freeing code expected to be
cache-cold.  The cold queue is mainly for lock amortisation, although
it is possible to explicitly allocate cold pages.  The readahead code
does that.

I have been hot and cold on these patches for quite some time - the
benefit is not great.

- 4% speedup in Randy Hron's benching of the autoconf regression
  tests on a 4-way.  Most of this came from savings in pte_alloc and
  pmd_alloc: the pagetable clearing code liked the warmer pages (some
  architectures still have the pgt_cache, and can perhaps do away with
  them).

- 1% to 2% speedup in kernel compiles on my 4-way and Martin's 32-way.

- 60% speedup in a little test program which writes 80 kbytes to a
  file and ftruncates it to zero again.  Ran four instances of that on
  4-way and it loved the cache warmth.

- 2.5% speedup in Specweb testing on 8-way

- The thing which won me over: an 11% increase in throughput of the
  SDET benchmark on an 8-way PIII:

	with hot & cold:

	RESULT for 8 users is 17971    +12.1%
	RESULT for 16 users is 17026   +12.0%
	RESULT for 32 users is 17009   +10.4%
	RESULT for 64 users is 16911   +10.3%

	without:

	RESULT for 8 users is 16038
	RESULT for 16 users is 15200
	RESULT for 32 users is 15406
	RESULT for 64 users is 15331

  SDET is a very old SPEC test which simulates a development
  environment with a large number of users.  Lots of users running a
  mix of shell commands, basically.


These patches were written by Martin Bligh and myself.

This one implements rmqueue_bulk() - a function for removing multiple
pages of a given order from the buddy lists.

This is for lock amortisation: take the highly-contended zone->lock
with less frequency, do more work once it has been acquired.

38e419f5

[PATCH] percpu: convert global page accounting · afce7191

Andrew Morton authored Oct 29, 2002

Convert global page state accounting to use per-cpu storage

(I think this code remains a little buggy, btw.  Note how I do

	per_cpu(page_states, cpu).member += (delta);

This gets done at interrupt time and hence is assuming that
the "+=" operation on a ulong is atomic wrt interrupts on
all architectures. How do we feel about that assumption?)

afce7191

[PATCH] percpu: create an EXPORT_PER_CPU_SYMBOL() macro · 999eac41
Andrew Morton authored Oct 29, 2002
```
This is needed so that per-cpu information in the core kernel can be
accessed from modules.
```
999eac41

[PATCH] percpu: convert buffer.c · e252fb96

Andrew Morton authored Oct 29, 2002

Patch from Dipankar Sarma <dipankar@in.ibm.com>

This patch makes per_cpu bh_accounting safe for cpu_possible
allocation by using cpu notifiers.

e252fb96

[PATCH] percpu: convert softirqs · c1bf37e9

Andrew Morton authored Oct 29, 2002

Patch from Dipankar Sarma <dipankar@in.ibm.com>

This patch makes per_cpu tasklet vectors safe for cpu_possible
allocation by using CPU notifiers.

c1bf37e9

[PATCH] percpu: convert timers · cf228cdc

Andrew Morton authored Oct 29, 2002

Patch from Dipankar Sarma <dipankar@in.ibm.com>

This patch changes the per-CPU data in timer management (tvec_bases)
to use per_cpu data area and makes it safe for cpu_possible allocation
by using CPU notifiers. End result - saving space.

Depends on cpu_possible patch.

cf228cdc

[PATCH] percpu: convert RCU · c12e16e2

Andrew Morton authored Oct 29, 2002

Patch from Dipankar Sarma <dipankar@in.ibm.com>

This patch convers RCU per_cpu data to use per_cpu data area
and makes it safe for cpu_possible allocation by using CPU
notifiers.

c12e16e2

[PATCH] percpu: fix compile warning for UP builds · 0c83f291

Andrew Morton authored Oct 29, 2002

A typical construct is:

	int cpu = get_cpu();

	foo = per_cpu(bar, cpu);
	put_cpu();

but this generates a compiler warning on uniprocessor builds: unused
variable `cpu'.

Add a dummy ref to `cpu' to per_cpu() to prevent this.

0c83f291

[PATCH] percpu: balance_dirty_pages ratelimit counters · f98bf5ff
Andrew Morton authored Oct 29, 2002
```
Convert balance_dirty_pages_ratelimited() to use percpu storage
for the ratelimiting counters.
```
f98bf5ff
[UDP]: Delete buggy assertion. · 4c664ca5
Alexey Kuznetsov authored Oct 29, 2002

4c664ca5

[PATCH] slab: Use CPU notifiers · 4524ea04

Andrew Morton authored Oct 29, 2002

- allocate memory for cpu buffers in cpu_up_prepare

- start the timer in cpu_online

- free the memory for cpu buffers in cpu_up_cancel.

4524ea04

[PATCH] slab: additional code cleanup · b464df2e

Andrew Morton authored Oct 29, 2002

From Manfred Spraul

- remove all typedef, except the kmem_bufctl_t.  It's a redefine for
  an int, i.e.  qualifies as tiny.

- convert most macros to inline functions.

b464df2e

[PATCH] slab: Remove cache_chain_lock · 716b7ab1

Andrew Morton authored Oct 29, 2002

Manfred added a new lock to protect the global list of slab caches.  We
already have a semaphore from those but he needs locking from timer
context.

So here we remove that lock and just do a down_trylock() on the
existing semaphore.  If that fails give up - we'll try again next timer
tick.

716b7ab1

[PATCH] slab: Rework the slab timer code to use add_timer_on · bf19f75e

Andrew Morton authored Oct 29, 2002

Manfred had all this weird code to schedule a kernel thread onto a
different CPU just so that we could bond a timer to that CPU.

Convert it all to use the new add_timer_on().

bf19f75e

[PATCH] slab: reap timers · fd1425d5

Andrew Morton authored Oct 29, 2002

- add a reap timer that returns stale objects from the cpu arrays
- use list_for_each instead of while loops
- /proc/slabinfo layout change, for a new field about reaping.

Implementation:
slab contains 2 caches that contain objects that might be usable to the
systems:
- the cpu arrays contains objects that other cpus could use
- the slabs_free list contains freeable slabs, i.e. pages that someone
else might want.

The patch now keeps track of accesses to the cpu arrays and to the free
list. If there were no recent activities in one of the caches, part of
the cache is flushed.

Unlike <2.5.39, only a small part (~20%) is flushed each time:
The older kernel would refill/drain bounce heavily under memory pressure:

- kmem_cache_alloc: notices that there are no objects in the cpu
        cache, loads 120 objects from the slab lists, return 1.
        [assuming batchcount=120]
- kmem_cache_reap is called due to memory pressure, finds 119
        objects in the cpu array and returns them to the slab lists.
- repeat.

In addition, the length of the free list is limited based on the free
list accesses: a fixed "1" limit hurts the large object caches.

That's the last part for now, next is: [not yet written]
- cleanup: BUG_ON instead of if() BUG
- OOM handling for enable_cpucaches
- remove the unconditional might_sleep() from
        cache_alloc_debugcheck_before, and make that DEBUG dependant.
- initial NUMA support, just to collect some stats:
        Which percentage of the objects are freed on the wrong
        node? 0.1% or 20%?

fd1425d5

[PATCH] slab: uninline poisoning checks · 1aabbecc
Andrew Morton authored Oct 29, 2002
```
remove inline from the cache poison checks: the functions are not
performance critical.
```
1aabbecc