Commits · 07ebe427f0289a322c96c38906bb3bb7aacf15b6 · nexedi / linux

12 Apr, 2004 40 commits

[PATCH] set mod->waiter before calling stop_machine · 07ebe427

Andrew Morton authored Apr 11, 2004

From: Rusty Russell <rusty@rustcorp.com.au>

mod->waiter needs to be set before we try to stop the module: setting it in
__try_stop_module means it gets set to the kthread, not rmmod.

07ebe427

[PATCH] slab: updates for per-arch alignments · b9e55f3d

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Description:

Right now kmem_cache_create automatically decides about the alignment of
allocated objects. The automatic decisions are sometimes wrong:

- for some objects, it's better to keep them as small as possible to
  reduce the memory usage.  Ingo already added a parameter to
  kmem_cache_create for the sigqueue cache, but it wasn't implemented.

- for s390, normal kmalloc must be 8-byte aligned.  With debugging
  enabled, the default allocation was 4-bytes.  This means that s390 cannot
  enable slab debugging.

- arm26 needs 1 kB aligned objects.  Previously this was impossible to
  generate, therefore arm has its own allocator in
  arm26/machine/small_page.c

- most objects should be cache line aligned, to avoid false sharing.  But
  the cache line size was set at compile time, often to 128 bytes for
  generic kernels.  This wastes memory.  The new code uses the runtime
  determined cache line size instead.

- some caches want an explicit alignment.  One example are the pte_chain
  objects: they must find the start of the object with addr&mask.  Right
  now pte_chain objects are scaled to the cache line size, because that was
  the only alignment that could be generated reliably.

The implementation reuses the "offset" parameter of kmem_cache_create and
now uses it to pass in the requested alignment.  offset was ignored by the
current implementation, and the only user I found is sigqueue, which
intended to set the alignment.

In the long run, it might be interesting for the main tree: due to the 128
byte alignment, only 7 inodes fit into one page, with 64-byte alignment, 9
inodes - 20% memory recovered for Athlon systems.



For generic kernels  running on P6 cpus (i.e. 32 byte cachelines), it means

Number of objects per page:

 ext2_inode_cache: 8 instead of 7
 ext3_inode_cache: 8 instead of 7
 fat_inode_cache: 9 instead of 7
 rpc_tasks: 24 instead of 15
 tcp_tw_bucket: 40 instead of 30
 arp_cache: 40 instead of 30
 nfs_write_data: 9 instead of 7

b9e55f3d

[PATCH] Fix scripts/kernel-doc to handle __attribute__ · 1aa6c0d1

Andrew Morton authored Apr 11, 2004

From: Tom Rini <trini@kernel.crashing.org>

The following patch is needed so that kernel-doc can handle functions which
have __attribute__'s on them (such as __attribute__ ((weak))).

1aa6c0d1

[PATCH] readv/writev range checking fix · fb14ef35

Andrew Morton authored Apr 11, 2004

do-readv_writev() is trying to fail if

a) any of the segments have a length < 0 or

b) the sum of the segments wraps negative.

But it gets b) wrong because local variable tot_len is unsigned.

Fix that up.

fb14ef35

[PATCH] jbd: fix I/O error handling · b1ee3fea

Andrew Morton authored Apr 11, 2004

Fix a few buglets spotted by Jeff Mahoney <jeffm@suse.com>. We're currently
only checking for I/O errors against journal buffers if they were locked when
they were first inspected.

We need to check buffer_uptodate() even if the buffers were already unlocked.

b1ee3fea

[PATCH] JBD: ordered-data commit cleanup · 2b38960c

Andrew Morton authored Apr 11, 2004

For data=ordered, kjournald at commit time has to write out and wait upon a
long list of buffers.  It does this in a rather awkward way with a single
list.  it causes complexity and long lock hold times, and makes the addition
of rescheduling points quite hard

So what we do instead (based on Chris Mason's suggestion) is to add a new
buffer list (t_locked_list) to the journal.  It contains buffers which have
been placed under I/O.

So as we walk the t_sync_datalist list we move buffers over to t_locked_list
as they are written out.

When t_sync_datalist is empty we may then walk t_locked_list waiting for the
I/O to complete.

As a side-effect this means that we can remove the nasty synchronous wait in
journal_dirty_data which is there to avoid the kjournald livelock which would
otherwise occur when someone is continuously dirtying a buffer.

2b38960c

[PATCH] jbd: fix ordered-data writeout logic · 376fd482

Andrew Morton authored Apr 11, 2004

There's some nasty code in commit which deals with a lock ranking problem.
Currently if it fails to get the lock when and local variable `bufs' is zero
we forget to write out some ordered-data buffers. So a subsequent
crash+recovery could yield stale data in existing files.

Fix it by correctly restarting the t_sync_datalist search.

376fd482

[PATCH] speed up ext2 fsync() and fdatasync() · 7176142a

Andrew Morton authored Apr 11, 2004

ext2_sync_file() forgets to clear the inode's dirty bits, so we write the
inode on every fsync(), even if it hasn't changed.

Fix that up via the new sync_file() API which correctly manages the inode
state bits and the superblock inode lists.

When performing file overwrite on IDE with and without writeback caching
enabled this patch approximately doubles fsync() speed, bringing it into line
with O_SYNC writes.

Also, fix up the return value handling in ext2_sync_file().

Credit due to Jeffrey Siegal <jbs@quiotix.com> who noticed the performance
discrepancy and wrote a test app.

7176142a

[PATCH] ext3 fsync() and fdatasync() speedup · a1ff5989

Andrew Morton authored Apr 11, 2004

ext3's fsync/fdatasync implementation is currently syncing the inode via a
full journal commit even if it was unaltered.

Fix that up by exporting the core VFS's inode sync function to modules and
calling it if the inode is dirty. We need to do it this way so that the
inode is moved to the appropriate superblock list and so that the i_state
dirty flags are appropriately updated.

This speeds up ext3 fsync() for file overwrites by a factor of four (disk
non-writeback) to forty (disk in writeback mode).

a1ff5989

[PATCH] Fix page allocator lower zone protection for NUMA · af70f767

Andrew Morton authored Apr 11, 2004

From: Martin Hicks <mort@wildopensource.com>

This changes __alloc_pages() so it uses precalculated values for the "min".
This should prevent the problem of min incrementing from zone to zone across
many nodes on a NUMA machine.  The result of falling back to other nodes with
the old incremental min calculations was that the min value became very
large.

af70f767

[PATCH] move job control fields from task_struct to signal_struct · 7860b371

Andrew Morton authored Apr 11, 2004

From: Roland McGrath <roland@redhat.com>

This patch moves all the fields relating to job control from task_struct to
signal_struct, so that all this info is properly per-process rather than
being per-thread.

7860b371

[PATCH] IPMI driver updates · 0ab2d668

Andrew Morton authored Apr 11, 2004

From: Corey Minyard <minyard@acm.org>

- Add support for messaging through an IPMI LAN interface, which is
  required for some system software that already exists on other IPMI
  drivers.  It also does some renaming and a lot of little cleanups.

- Add the "System Interface" driver.  The previous driver for system
  interfaces only supported the KCS interface, this driver supports all
  system interfaces defined in the IPMI standard.  It also does a much better
  job of handling ACPI and SMBIOS tables for detecting IPMI system
  interfaces.

0ab2d668

[PATCH] compat emulation for posix message queues · 87c22e84

Andrew Morton authored Apr 11, 2004

From: Arnd Bergmann <arnd@arndb.de>

I have tested the code with the open posix test suite and found the same
four failures for both 64-bit and compat mode, most tests pass. The patch
is against -mc1, but I guess it also applies to the other trees around.

What worries me more than mq_attr compatibility is the conversion of struct
sigevent, which might turn out really hard when more fields in there are
used. AFAICS, the only other part in the kernel ABI is sys_timer_create(),
so maybe it's not too late to deprecate the current structure and create a
structure that can be used properly for compat syscalls.

87c22e84

[PATCH] posix message queues: send notifications via netlink · 34b98f22

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

SIGEV_THREAD means that a given callback should be called in the context on a
new thread. This must be done by the C library. The kernel must deliver a
notice of the event to the C library when the callback should be called.

This patch switches to a new, simpler interface: User space creates a socket
with socket(PF_NETLINK, SOCK_RAW,0) and passes the fd to the mq_notify call
together with a cookie. When the mq_notify() condition is satisfied, the
kernel "writes" the cookie to the socket. User space then reads the cookie
and calls the appropriate callback.

34b98f22

[PATCH] split netlink_unicast · ed6dcf4a

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

The attached patch splits netlink_unicast into three steps:

- netlink_getsock{bypid,byfilp}: lookup the destination socket.

- netlink_attachskb: perform the nonblock checks, sleep if the socket
  queue is longer than the limit, etc.

- netlink_sendskb: actually send the skb.

jamal looked over it and didn't see a problem with the netlink change.  The
actual use from ipc/mqueue.c is still open (just send back whatever the C
library passed to mq_notify, add an nlmsghdr or perhaps even make it a
specialized netlink protocol), but the attached patch is independant from
the the message queue change.

(acked by davem)

ed6dcf4a

[PATCH] security bugfix for mqueue · b06d7b4c

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

I found a security bug in the new mqueue code: a process that has only
write permissions to a message queue could call mq_notify(SIGEV_THREAD) and
use the returned notification file descriptor to read from the message
queue.

b06d7b4c

[PATCH] posix message queue update · f3ca8d5d

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

My discussion with Ulrich had one result:

- mq_setattr can accept implementation defined flags.  Right now we have
  none, but we might add some later (e.g.  switch to CLOCK_MONOTONIC for
  mq_timed{send,receive} or something similar).  When we add flags, we
  might need the fields for additional information.  And they don't hurt.
  Therefore add four __reserved fields to mq_attr.

- fail mq_setattr if we get unknown flags - otherwise glibc can't detect
  if it's running on a future kernel that supports new features.

- use memset to initialize the mq_attr structure - theoretically we could
  leak kernel memory.

- Only set O_NONBLOCK in mq_attr, explicitely clear O_RDWR & friends.
  openposix uses getattr, attr |=O_NONBLOCK, setattr - a sane approach. 
  Without clearing O_RDWR, this fails.

I've retested all openposix conformance tests with the new patch - the two
new FAILED tests check undefined behavior.  Note that I won't have net
access until Sunday - if the message queue patch breaks something important
either ask Krzysztof or drop it.

Ulrich had another good idea for SIGEV_THREAD, but I must think about it.
It would mean less complexitiy in glibc, but more code in the kernel.  I'm
not yet convinced that it's overall better.

f3ca8d5d

[PATCH] posix message queues: made user mountable · b95db642

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Make the posix message queue mountable by the user.  This replaces ipcs and
ipcrm for posix message queue: The admin can check which queues exist with ls
and remove stale queues with rm.

I'd like a final confirmation from Ulrich that our SIGEV_THREAD approach is
the right thing(tm): He's aware of the design and didn't object, but I think
he hasn't seen the final API yet.

b95db642

[PATCH] posix message queues: linux-specific poll extension · 0301b50b

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Linux specific extension: make the message queue identifiers pollable.  It's
simple and could be useful.

0301b50b

[PATCH] posix message queues: implementation · be94d44e

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Actual implementation of the posix message queues, written by Krzysztof
Benedyczak and Michal Wronski.  The complete implementation is dependant on
CONFIG_POSIX_MQUEUE.

It passed the openposix test suite with two exceptions: one mq_unlink test
was bad and tested undefined behavior.  And Linux succeeds
mq_close(open(,,,)).  The spec mandates EBADF, but we have decided to ignore
that: we would have to add a new syscall just for the right error code.

The patch intentionally doesn't use all helpers from fs/libfs for kernel-only
filesystems: step 5 allows user space mounts of the file system.



Signal changes:

The patch redefines SI_MESGQ using __SI_CODE: The generic Linux ABI uses
a negative value (i.e.  from user) for SI_MESGQ, but the kernel internal
value must be posive to pass check_kill_value.  Additionally, the patch
adds support into copy_siginfo_to_user to copy the "new" signal type to
user space.



Changes in signal code caused by POSIX message queues patch:

General & rationale:

  mqueues generated signals (only upon notification) must have si_code
  == SI_MESGQ.  In fact such a signal is send from one process which
  caused notification (== sent message to empty message queue) to
  another which requested it.  Both processes can be of course unrelated
  in terms of uids/euids.  So SI_MESGQ signals must be classified as
  SI_FROMKERNEL to pass check_kill_permissions (not need to say that
  this signals ARE from kernel).

  Signals generated by message queues notification need the same
  fields in siginfo struct's union _sifields as POSIX.1b signals and we
  can reuse its union entry.

  SI_MESGQ was previously defined to -3 in kernel and also in glibc. 
  So in userspace SI_MESGQ must be still visible as -3.

Solution:

  SI_MESGQ is defined in the same style as SI_TIMER using __SI_CODE macro.

  Details:

    Fortunately copy_siginfo_to_user copies si_code as short.  So we
    can use remaining part of int value freely.  __SI_CODE does the
    work.  SI_MESGQ is in kernel:

 		6<<16 | (-3 & 0xffff) what is > 0

    but to userspace is copied

 		(short) SI_MESGQ == -3

Actual changes:

  Changes in include/asm-generic/siginfo.h

  __SI_MESGQ added in signal.h to represent inside-kernel prefix of
  SI_MESGQ.  SI_MESGQ is redefined from -3 to __SI_CODE(__SI_MESGQ, -3)

  Except mips architecture those changes should be arch independent
  (asm-generic/siginfo.h is included in arch versions).  On mips
  SI_MESGQ is redefined to -4 in order to be compatible with IRIX.  But
  the same schema can be used.

  Change in copy_siginfo_to_user: We only add one line to order the
  same copy semantics as for _SI_RT.

  This change isn't very portable - some arch have its own
  copy_siginfo_to_user.  All those should have similar change (but
  possibly not one-line as _SI_RT case was sometimes ignored because i
  wasn't used yet, e.g.  see ia64 signal.c).

Update:
mq: only fail with invalid timespec if mq_timed{send,receive} needs to block
From: Jakub Jelinek <jakub@redhat.com>

POSIX requires EINVAL to be set if:
"The process or thread would have blocked, and the abs_timeout parameter
specified a nanoseconds field value less than zero or greater than or equal
to 1000 million."
but 2.6.5-mm3 returns -EINVAL even if the process or thread would not block
(if the queue is not empty for timedreceive or not full for timedsend).

be94d44e

[PATCH] posix message queues: syscall stubs · c50142a5

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

Add -ENOSYS stubs for the posix message queue syscalls.  The API is a direct
mapping of the api from the unix spec, with two exceptions:

- mq_close() doesn't exist.  Message queue file descriptors can be closed
  with close().

- mq_notify(SIGEV_THREAD) cannot be implemented in the kernel.  The kernel
  returns a pollable file descriptor .  User space must poll (or read) this
  descriptor and call the notifier function if the file descriptor is
  signaled.

c50142a5

[PATCH] posix message queues: code move · c334f752

Andrew Morton authored Apr 11, 2004

From: Manfred Spraul <manfred@colorfullife.com>

cleanup of sysv ipc as a preparation for posix message queues:

- replace !CONFIG_SYSVIPC wrappers for copy_semundo and exit_sem with
  static inline wrappers.  Now the whole ipc/util.c file is only used if
  CONFIG_SYSVIPC is set, use makefile magic instead of #ifdef.

- remove the prototypes for copy_semundo and exit_sem from kernel/fork.c

- they belong into a header file.

- create a new msgutil.c with the helper functions for message queues.

- cleanup the helper functions: run Lindent, add __user tags.

c334f752

[PATCH] md: merge_bvec_fn needs to know about partitions. · 00d1b0e9

Andrew Morton authored Apr 11, 2004

From: Neil Brown <neilb@cse.unsw.edu.au>

Addresses http://bugme.osdl.org/show_bug.cgi?id=2355

It seems that a merge_bvec_fn needs to be aware of partitioning...  who
would have thought it :-(

The following patch should fix the merge_bvec_fn for both linear and raid0.
We teach linear and raid0 about partitions in the merge_bvec_fn.

->merge_bvec_fn needs to make decisions based on the physical geometry of the
device.  For raid0, it needs to decide if adding the bvec to the bio will
make the bio span two drives.

To do this, it needs to know where the request is (what the sector number is)
in the whole device.

However when called from bio_add_page, bi_sector is the sector number
relative to the current partition, as generic_make_request hasn't been called
yet.

So raid_mergeable_bvec needs to map bio->bi_sector (which is partition
relative) to a bi_sector which is device relative, so it can perform proper
calculations about when chunk boundaries are.

00d1b0e9

[PATCH] knfsd: Add data integrity to serve rside gss · 9abdc660

Andrew Morton authored Apr 11, 2004

From: NeilBrown <neilb@cse.unsw.edu.au>

From: "J. Bruce Fields" <bfields@fieldses.org>

rpcsec_gss supports three security levels:

1.  authentication only: sign the header of each rpc request and response.

2. integrity: sign the header and body of each rpc request and response.

3.  privacy: sign the header and encrypt the body of each rpc request and
   response.

The first 2 are already supported on the client; this adds integrity support
on the server.

9abdc660

[PATCH] knfsd: Export a symbol needed by auth_gss · 238a06e2

Andrew Morton authored Apr 11, 2004

From: NeilBrown <neilb@cse.unsw.edu.au>

From: "J. Bruce Fields" <bfields@fieldses.org>

Without this compiling auth_gss as module fails.

238a06e2

[PATCH] knfsd: Improve UTF8 checking. · 1a260c78

Andrew Morton authored Apr 11, 2004

From: NeilBrown <neilb@cse.unsw.edu.au>

From: Fred.  We don't do all the utf8 checking we could in the kernel, but we
do some simple checks.  Implement slightly stricter, and probably more
efficient, checking.

1a260c78

[PATCH] knfsd: Add server-side support for the nfsv4 mounted_on_fileid attribute. · c02c0886
Andrew Morton authored Apr 11, 2004
```
From: NeilBrown <neilb@cse.unsw.edu.au>
```
c02c0886
[PATCH] knfsd: Remove name_lookup.h that noone is using anymore. · 94b1c3eb
Andrew Morton authored Apr 11, 2004
```
From: NeilBrown <neilb@cse.unsw.edu.au>
```
94b1c3eb
[PATCH] knfsd: fix a problem with incorrectly formatted auth_error returns. · 5b2b9a81
Andrew Morton authored Apr 11, 2004
```
From: NeilBrown <neilb@cse.unsw.edu.au>

From: Fred Isaman
```
5b2b9a81
[PATCH] knfsd: Minor fix to error return when updating server authentication information · d4658c74
Andrew Morton authored Apr 11, 2004
```
From: NeilBrown <neilb@cse.unsw.edu.au>
```
d4658c74
[PATCH] knfsd: Return -EOPNOTSUPP when unknown mechanism name encountered · 4f9c4e9d
Andrew Morton authored Apr 11, 2004
```
From: NeilBrown <neilb@cse.unsw.edu.au>

It's better than oopsing.
```
4f9c4e9d

[PATCH] search for /init for initramfs boots · 8b770c1d

Andrew Morton authored Apr 11, 2004

From: Olaf Hering <olh@suse.de>

initramfs can not be used in current 2.6 kernels, the files will never be
executed because prepare_namespace doesn't care about them.  The only way to
workaround that limitation is a root=0:0 cmdline option to force rootfs as
root filesystem.  This will break further booting because rootfs is not the
final root filesystem.

This patch checks for the presence of /init which comes from the cpio archive
(and thats the only way to store files into the rootfs).  This binary/script
has to do all the work of prepare_namespace().

8b770c1d

[PATCH] fs/inode.c list_head cleanup · 27d2e5e5
Andrew Morton authored Apr 11, 2004
```
Teach inode.c about list_move().
```
27d2e5e5

[PATCH] Quota locking fixes · ed678f13

Andrew Morton authored Apr 11, 2004

From: Jan Kara <jack@ucw.cz>

Change locking rules in quota code to fix lock ordering especially wrt
journal lock. Also some unnecessary spinlocking is removed. The locking
changes are mainly: dqptr_sem, dqio_sem are acquired only when transaction is
already started, dqonoff_sem before a transaction is started. This change
requires some callbacks to ext3 (also implemented in this patch) to start
transaction before the locks are acquired.

ed678f13

[PATCH] ppc44x: fix memory leak · a97de48b

Andrew Morton authored Apr 11, 2004

From: Matt Porter <mporter@kernel.crashing.org>

This fixes a memory leak when freeing pgds on PPC44x.

a97de48b

[PATCH] ppc64: UP compile fixes · 425c9687
Andrew Morton authored Apr 11, 2004
```
From: Anton Blanchard <anton@samba.org>

UP compile fixes
```
425c9687
[PATCH] ppc64: Quieten NVRAM driver · acfc20f7
Andrew Morton authored Apr 11, 2004
```
From: Anton Blanchard <anton@samba.org>

Quieten NVRAM driver
```
acfc20f7

[PATCH] ppc64: Remove unused rtas functions · ec19a28d

Andrew Morton authored Apr 11, 2004

From: Joel Schopp <jschopp@austin.ibm.com>

I was looking at rtas serialization for reasons I won't go into here.
While wandering through the code I found that two functions were not
properly serialized.  phys_call_rtas and phys_call_rtas_display_status are
the functions.  After looking further they are redundant and not
used anywhere at all.

ec19a28d

[PATCH] ppc64: DMA API updates · 9ed9e7e5

Andrew Morton authored Apr 11, 2004

From: Anton Blanchard <anton@samba.org>

DMA API updates, in particular adding the new cache flush interfaces.

9ed9e7e5

[PATCH] ppc64: Add smt_snooze_delay cpu sysfs attribute · b7ceb145
Andrew Morton authored Apr 11, 2004
```
From: Anton Blanchard <anton@samba.org>

Add smt_snooze_delay cpu sysfs attribute
```
b7ceb145