Commits · 99365bd4725d431255ff4bdd51fb3dca60c47322 · nexedi / linux

29 Dec, 2003 40 commits

[PATCH] synchronize use of mm->core_waiters · 99365bd4

Andrew Morton authored Dec 29, 2003

From: Roland McGrath <roland@redhat.com>

I believe I have identified a failure mode that Linus saw a couple weeks
back when tracking down some other fork/exit sorts of races.  We saw this
come up on rare occasions with the RHEL3 kernel's backport of the new code
(while trying to track down other race failure modes we have yet to fix, sigh).

I am talking about the following scenario:

> Btw, even with the fix, doing a "while : ; ./crash t 10 ; done" will
> eventually result in a stuck process:
>
> 	 1415 tty1     D      0:00 ./crash
>
> This is some kind of deadlock: most of the fifty threads are in "D"
> state, with a trace something like
>
> 	 [<c011fbe3>] schedule+0x360/0x7f8
> 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
> 	 [<c0128c9e>] do_exit+0x627/0x6a4
> 	 [<c0128ddd>] do_group_exit+0x3d/0x177
> 	 [<c0130c13>] dequeue_signal+0x2d/0x84
> 	 [<c0133911>] get_signal_to_deliver+0x390/0x575
> 	 [<c010a541>] do_signal+0x6c/0xf1
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c013d50f>] do_futex+0x6d/0x7d
> 	 [<c013d635>] sys_futex+0x116/0x12f
> 	 [<c010a601>] do_notify_resume+0x3b/0x3d
> 	 [<c010a82e>] work_notifysig+0x13/0x15
>
> except for one that is trying to core-dump:
>
> 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c02101aa>] rwsem_wake+0x86/0x12d
> 	 [<c01738af>] coredump_wait+0xa8/0xaa
> 	 [<c0173a26>] do_coredump+0x175/0x26c
>
> and three that are just doing a regular "exit()" system call:
>
> 	 [<c011fbe3>] schedule+0x360/0x7f8
> 	 [<c011e19a>] recalc_task_prio+0x90/0x1aa
> 	 [<c0120539>] wait_for_completion+0xd4/0x1c3
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c01200be>] default_wake_function+0x0/0x12
> 	 [<c0210207>] rwsem_wake+0xe3/0x12d
> 	 [<c0128c9e>] do_exit+0x627/0x6a4
> 	 [<c0128d4d>] next_thread+0x0/0x53
> 	 [<c010a7e3>] syscall_call+0x7/0xb
>
> However, the rest of the system is totally unaffected by this deadlock:
> it's only deadlocked withing the thread group itself, nobody else cares.

What happens here is a race between an exiting thread checking
mm->core_waiters in __exit_mm, and the thread taking the core-dump signal
(in coredump_wait) examining the first thread's ->mm pointer and
incrementing mm->core_waiters to account for it.  There is no
synchronization at all in __exit_mm's use of mm->core_waiters.  If the
coredump_wait thread reads tsk->mm when tsk is in __exit_mm between
checking mm->core_waiters and clearing tsk->mm, then it will increment
mm->core_waiters and the total count will later exceed the number of
threads that will ever decrement it and synchronize.  Hence it blocks forever.

The following patch fixes the problem by using mm->mmap_sem in __exit_mm.
The read lock must be held around checking mm->core_waiters and clearing
tsk->mm so that coredump_wait (which gets the write lock) cannot come in
between and do bogus bookkeeping.

99365bd4

[PATCH] DAC960 request queue per disk · dc942a21

Andrew Morton authored Dec 29, 2003

From: Dave Olien <dmo@osdl.org>

Here's a patch that changes the DAC960 driver from having one request
queue for ALL disks on the controller, to having a request queue for
each logical disk.  This turns out to make little difference for deadline
scheduler, nor for AS scheduler under light IO load.  But under AS
scheduler with heavy IO, it makes about a 40% difference on dbt2
workload.  Here are the measured numbers:

The 2.6.0-test11-D kernel version includes this mutli-queue patch to the
DAC960 driver.

For non-cached dbt2 workload  (heavy IO load)

Scheduler	kernel/driver	NOTPM(bigger is better)
AS		2.6.0-test11-D  1598
AS		2.6.0-test11     973
deadline	2.6.0-test11    1640
deadline	2.6.0-test11-D  1645

For cached dbt2 workload (lighter IO load)

AS		2.6.0-test11-D  4993
AS		2.6.-test6-mm4  4976, 4890, 4972
deadline	2.6.0-test11-D  4998

Can this be included in 2.6.0?  I know it's not a "critical patch"
in the sense that something won't work without it.  On the other hand,
the change is isolated to a driver.

dc942a21

[PATCH] fix userspace compiles with nbd.h · dd5a4db6

Andrew Morton authored Dec 29, 2003

From: Paul Clements <Paul.Clements@SteelEye.com>

A previous "cleanup" on the nbd.h header file broke userspace compiles.
I've added an #ifdef __KERNEL__ so that userspace doesn't need to worry
about the nbd_device structure, which is only used in-kernel. The patch
allows me to compile my nbd tools with the 2.6 nbd.h.

dd5a4db6

[PATCH] isdn_ppp_ccp.c uses uninitialized spinlock · fe8bbcd3

Andrew Morton authored Dec 29, 2003

From: Tonnerre Anklin <thunder@keepsake.ch>

This spinlock was used uninitialized. Gave me a lot of warnings.

fe8bbcd3

[PATCH] nr_slab accounting fix · d71abcaf

Andrew Morton authored Dec 29, 2003

From: Manfred Spraul <manfred@colorfullife.com>

if alloc_slabmgmt fails, then kmem_freepages() calls sub_page_state(),
altough nr_slab was not yet increased.  The attached patch fixes that by
moving the inc_page_state into kmem_getpages().

d71abcaf

[PATCH] More MODULE_ALIASes · 6788a95d

Andrew Morton authored Dec 29, 2003

From: Rusty Russell <rusty@rustcorp.com.au>
      Steve Youngs, Stephen Hemminger

Three more MODULE_ALIASes.  Trivial, but useful if people want things
to "just work" in 2.6.0.

6788a95d

[PATCH] struct_cpy compilation warning · e85132b2

Andrew Morton authored Dec 29, 2003

From: Ingo Molnar <mingo@elte.hu>

i've attached a minor fix for the 2.6.1 timeframe - we clearly meant
__struct_cpy_bug().  Newest versions of gcc warn about this.

e85132b2

[PATCH] slab reclaim accounting fix · 1cdf0eef

Andrew Morton authored Dec 29, 2003

From: Manfred Spraul <manfred@colorfullife.com>

slab_reclaim_pages is increased even if get_free_pages fails.  The attached
patch moves the update to the correct position.

1cdf0eef

[PATCH] fix outdated comment in jiffies.h · 162bc7d1
Andrew Morton authored Dec 29, 2003
```
From: Tim Schmielau <tim@physik3.uni-rostock.de>
```
162bc7d1

[PATCH] Allow unimap change on non fg console · a4b05bb1

Andrew Morton authored Dec 29, 2003

From: Kurt Garloff <garloff@suse.de>

The comment in front of vt_ioctl() reads
/*
 * We handle the console-specific ioctl's here.  We allow the
 * capability to modify any console, not just the fg_console.=20
 */

Unfortunately, this does not apply to PIO_UNIMAPCLR, nor
GIO_/PIO_UNIMAP. They always operate on the current foreground
console, which is inconsistent at least. For most ioctls, the
comment is applicable.

It also causes problems, as setfont can't do the full job on
the non-fg consoles. (OK, our setfont is slightly changed to
even try it ... as you know.)

The attached patch does fix this.

I have a similar patch for 2.4, but it never got merged :-(
because not many people seem to care and I submitted in the middle
of the 2.4 series ...
It has been in UnitedLinux/SUSE kernels for ages, though.

a4b05bb1

[PATCH] Clear dirty bits etc on compound frees · e86ff3c7

Andrew Morton authored Dec 29, 2003

From: "Martin J. Bligh" <mbligh@aracnet.com>,
      Guillaume Morin <guillaume@morinfr.org>

We need to clear the software dirty bit on the tail pages of a compound page
when freeing it up.

The tail pages can become dirtied by mmap'ing /dev/mem, and writing into
any clustered page group (that a driver might have created or whatever).

Plus it's better to run all these pages through the free_pages_check checks
anyway.

e86ff3c7

[PATCH] list_empty_careful() documentation. · 3182fe92

Andrew Morton authored Dec 29, 2003

From: Ingo Molnar <mingo@elte.hu>

I'd also suggest the following patch below, to clarify the use of
unsynchronized list_empty(). list_empty_careful() can only be safe in the
very specific case of "one-shot" list entries which might be removed by
another CPU. (but nothing else can happen to them and this is their only
final state.) list_empty_careful() is otherwise completely unsynchronized
on both the compiler and CPU level and is not 'SMP safe' in any way.

3182fe92

[PATCH] MAINTAINERS vger.rutgers.edu · c13bb409

Andrew Morton authored Dec 29, 2003

From: Geert Uytterhoeven <geert@linux-m68k.org>

Mailing lists at vger.rutgers.edu are obsolete, use vger.kernel.org
instead.

c13bb409

[PATCH] more correct get_compat_timespec interface · 0eea2040

Andrew Morton authored Dec 29, 2003

From: Joe Korty <joe.korty@ccur.com>

The API for get_compat_timespec / put_compat_timespec is incorrect, it
forces a caller with const args to (incorrectly) cast.  The posix message
queue patch is one such caller.

0eea2040

[PATCH] dvb i2c timeout fix · 0f4e98bc

Andrew Morton authored Dec 29, 2003

From: Gerd Knorr <kraxel@bytesex.org>

Below is a ObviouslyCorrect[tm] patch which fixes the i2c bus timeout
handling in the saa7146 driver.

0f4e98bc

[PATCH] JBD: b_committed_data locking fix · 524e63d2

Andrew Morton authored Dec 29, 2003

The locking rules say that b_committed_data is covered by
jbd_lock_bh_state(), so implement that during the start of commit, while
throwing away unused shadow buffers.

I don't expect that there is really a race here, but them's the rules.

524e63d2

[PATCH] O_DIRECT memory leak fix · 7e3989bb

Andrew Morton authored Dec 29, 2003

From: Badari Pulavarty <pbadari@us.ibm.com>

I found the problem with O_DIRECT memory leak.

The problem is, when we are doing DIO read and crossed the end of file - we
don't release referencess on all the pages we got from get_user_pages().
(since it is a success case).

The fix is to call dio_cleanup() even for sucess cases.

7e3989bb

[PATCH] fix ELF exec with huge bss · 0363994f

Andrew Morton authored Dec 29, 2003

From: Roland McGrath <roland@redhat.com>

The following test program will crash every time if dynamically linked.
I think this bites all 32-bit platforms, including 32-bit executables on
64-bit platforms that support them (and could in theory bite 64-bit
platforms with bss sizes beyond the bounds of comprehension).

	volatile char hugebss[1080000000];
	main() { printf("%p..%p\n", &hugebss[0], &hugebss[sizeof hugebss]);
	 system("cat /proc/$PPID/maps");
	 hugebss[sizeof hugebss - 1] = 1;
	 return 23;
	}

The problem is that the kernel maps ld.so at 0x40000000 or some such place,
before it maps the bss.  Here the bss is so large that it overlaps and
clobbers that mapping.  I've changed it to map the bss before it loads the
interpreter, so that part of the address space is reserved before ld.so's
mapping (which doesn't really care where it goes) is done.

This patch also adds error checking to the bss setup (and interpreter's bss
setup).  With the aforementioned change but no error checking, "ulimit -v
65536; ./hugebss" will crash in the store after the `system' call, because
the kernel will have failed to allocate the bss and ignored the error, so
the program runs without those pages being mapped at all.  With this change
it dies with a SIGKILL as for a failure to set up stack pages.  It might be
even better to try to detect the case earlier so that execve can return an
error before it has wiped out the address space.  But that seems like it
would always be fragile and miss some corner cases, so I did not try to add
such complexity.

0363994f

[PATCH] Erronous use of tick_usec in do_gettimeofday · 709087ca

Andrew Morton authored Dec 29, 2003

From: Joe Korty <joe.korty@ccur.com>

do_gettimeofday() is using tick_usec which is defined in terms of USER_HZ
not HZ.

709087ca

[PATCH] md: set ra_pages for raid0/raid5 devices properly. · c5b971d7
Andrew Morton authored Dec 29, 2003
```
From: NeilBrown <neilb@cse.unsw.edu.au>

stripe to be effective.  This patch sets ra_pages
appropriately.
```
c5b971d7

[PATCH] md: Limit max_sectors on md when merge_bvec_fn defined on underlying device. · 59165b4f

Andrew Morton authored Dec 29, 2003

From: NeilBrown <neilb@cse.unsw.edu.au>

As no md personalities honour the merge_bvec_fn of underlying devices,
we must make sure never to submit a bio larger than 1 page when a 
merge_bvec_fn is defined.

raid5 already does this (it never submits bios larger than one page).
With this patch, all other raid personalities limit their
max_sectors when a merge_bvec_fn is present.

59165b4f

[PATCH] BINFMT_ELF=m is not an option · 0b0a866d
Andrew Morton authored Dec 29, 2003
```
From: glee@gnupilgrims.org

I think Adrian had forgotten to update the help text.
```
0b0a866d

[PATCH] Ext3+quota deadlock fix · db84a820

Andrew Morton authored Dec 29, 2003

From: Jan Kara <jack@ucw.cz>

here's patch which should fix deadlock with quotas+ext3 reported in 2.4
(the same problem existed in 2.6 but nobody found it).

db84a820

[PATCH] Fix possible oops in vfs_quota_sync() · b0d8c562

Andrew Morton authored Dec 29, 2003

From: Jan Kara <jack@ucw.cz>

I'm sending you a fix of possible Oops in vfs_quota_sync().  Actually
nobody has run into that I found it when I was looking through the code.

b0d8c562

[PATCH] sis comparison / assignment operator fix · 155717ab

Andrew Morton authored Dec 29, 2003

From: Geoffrey Lee <glee@gnupilgrims.org>

This fixes what seems to be an obvious = vs == bug in the init301.c sis
file.

155717ab

[PATCH] remove mm->swap_address · 695716f5

Andrew Morton authored Dec 29, 2003

From: William Lee Irwin III <wli@holomorphy.com>

This field is 100% unused. This patch removes it.

695716f5

[PATCH] Fix 32bit siginfo problems on x86-64 · 3b35cbe5

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

32bit siginfo would sometimes get passed incorrectly on x86-64. This
change fixes the conversion function to be a bit dumber, but more
correct.

3b35cbe5

[PATCH] Don't panic in mpparse on x86-64 · 53b3aa6c

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

Merge i386 fix. Don't panic in MP table parsing when the table is bad.

53b3aa6c

[PATCH] Signal fixes for x86-64 · ca981c9f

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

Merge signal race fixes from i386 to x86-64.

Fix a bug in system call restart, noted by John Blackwood.

ca981c9f

[PATCH] Merge i386 fix for page fault to x86-64 · 2988d8dd

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

Merge the i386 fix for the page fault from Linus to x86-64
(I'm not actually sure what it fixes, but if it's good for 32bit
it is likely good for 64bit too)

2988d8dd

[PATCH] Add more paranoid checking in x86-64 prefetch checker · cf79a124

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

Make sure we never access anything in kernel mapping while
doing the prefetch workaround checks on x86-64.

Originally suggested by Jamie Lockier.

cf79a124

[PATCH] Fix 32bit truncate on x86-64 · 8f0f4aaa

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

Another potential data corruption fix.

The 32bit truncate64 on x86-64 did silently truncate
offsets >32bit. That broke mysql for example. Fix that.

From Chris Wilson

8f0f4aaa

[PATCH] Fix sysrq-t on x86-64 · 3959fde8

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

From Badari Pulavarty

Without this sysrq-t shows the same backtrace for all processes on x86-64

3959fde8

[PATCH] Fix CPUID compilation on x86-64 · 2393a309

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

A lot of people have run into this: the x86-64 cpuid driver didn't
compile as module.

Using a kludge suggested by Sam Ravnsborg.

2393a309

[PATCH] Critical x86-64 IOMMU fixes for 2.6.0 · f2059100

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

Please consider applying this patch, I would consider it critical for x86-64.

The 2.6.0 x86-64 IOMMU code unfortunately had a few problems, leading
to non booting systems and in a few cases to data corruption.

It fixes a two serious bugs in handling special kinds of scatter gather
lists in pci_map_sg.

AGP was completely broken with IOMMU because of a wrong #ifdef.
Fix that.

One TLB flush optimization I did a long time ago seems to break on
some 3ware boards (who require IOMMU because they don't support 64bit
addresses).  The breakage lead to data corruption. This patch diables
the optimization for now and fixes a potential SMP race in the flush
code too. The TLB flush is done in a slower, but more reliable way
now too.

This patch fixes them. Please consider applying, because some of these
problems hit quite many people.

This also disables the IOMMU_DEBUG in the defconfig. A lot of people 
were using the IOMMU when they didn't need to, which multiplied the
problems.

IOMMU merge is disabled for now. This was an experimental optimization
which helped with some block devices, but for production it seems to
be better to disable it for now because there are some questionable
corner cases when the IOMMU aperture fragments. The same is done
for IOMMU SAC force, which was related to that. 

i386 has quite broken semantics for pci_alloc_consistent(). It uses
the standard device DMA mask instead of the consistent mask. Make us
bug-to-bug compatible here. This fixes problems with some sound
drivers that don't support full 32bit addressing.

f2059100

[PATCH] Add a.out support for x86-64 · b14a4258

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

Add 32bit a.out support for x86-64.

Not exactly an important bug fix, but maybe it will help someone.  This
should increase the current 98% compatibility to i386 to perhaps 98.1% @)

I tested an old a.out SuSE 4.2 installation in chroot and it worked.  It
also ran some very old linux binaries from '92 found on ftp.funet.fi.  The
only program that didn't was the SuSE a.out GNU emacs, but I was too lazy
to track that down.  Core dumps are not supported.

b14a4258

[PATCH] statfs64 fix · dce80777

Andrew Morton authored Dec 29, 2003

From: Andi Kleen <ak@muc.de>

It fixes the statfs64 emulation on x86-64.  The problem is that x86-64
needs an __attribute__((aligned)) on the compat_statfs64 structure.  The
conclusion last time this was discussed was that the structure should be
duplicated.

Essentially it is the old shared structure copied to every user and x86-64
uses __attribute__((packed)).

dce80777

[PATCH] dm and bounce buffer panic fix · 85734c47

Andrew Morton authored Dec 29, 2003

From: Mark Haverkamp <markh@osdl.org>

About three weeks ago markw at osdl posted a mail about a panic that he
was seeing:

http://marc.theaimsgroup.com/?l=linux-kernel&m=106737176716474&w=2

I believe what is happening, is that the dm __clone_and_map function is
generating bio structures with the bi_idx field non-zero.  When
__blk_queue_bounce creates a new bio with bounce pages, it sets the bi_idx
field to 0 rather than the bi_idx of the original.  This causes trouble since
bv_page pointers will be dereferenced later that are zero.  The following
uses the original bio structure's bi_idx in the new bio structure and in
copy_to_high_bio_irq and bounce_end_io.

This has cleared up the panic when using the volume.

(acked by Joe Thornber)

85734c47

[PATCH] ext3: bd_claim for journal device · 9907e736

Andrew Morton authored Dec 29, 2003

From: Neil Brown <neilb@cse.unsw.edu.au>

Change ext3 to run bd_claim() against external journal devices. It is
significant only for those who have ext3 journals on a separate device, and
gets exclusive access to that device.

9907e736

[PATCH] remove include recursion from linux/pagemap.h · 1fcec52f
Andrew Morton authored Dec 29, 2003
```
From: Arnaldo Carvalho de Melo <acme@conectiva.com.br>

pagemap.h, do not include thyself.
```
1fcec52f