Commits · d3069b4dd0767b4e24debdb21b632b2f8dd72474 · Kirill Smelkov / linux

An error occurred fetching the project authors.

18 Oct, 2004 3 commits

[PATCH] fix & clean up zombie/dead task handling & preemption · d3069b4d

Ingo Molnar authored 20 years ago

This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we
had.  Right now, the moment procfs does a down() that sleeps in
proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be
back to TASK_RUNNING to and we trigger this assert:

        schedule();
        BUG();
        /* Avoid "noreturn function does return".  */
        for (;;) ;

I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state
field, to allow the detaching of exit-signal/parent/wait-handling from
descheduling a dead task.  Dead-task freeing is done via PF_DEAD.

Tested the patch on x86 SMP and UP, but all architectures should work
fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

d3069b4d

[PATCH] exec: fix posix-timers leak and pending signal loss · fef60c1b

Roland McGrath authored 20 years ago

I've found some problems with exec and fixed them with this patch to
de_thread.

The second problem is that a multithreaded exec loses all pending signals.
This is violation of POSIX rules. But a moment's thought will show it's
also just not desireable: if you send a process a SIGTERM while it's in the
middle of calling exec, you expect either the original program in that
process or the new program being exec'd to handle that signal or be killed
by it. As it stands now, you can try to kill a process and have that
signal just evaporate if it's multithreaded and calls exec just then. I
really don't know what the rationale was behind the de_thread code that
allocates a new signal_struct. It doesn't make any sense now. The other
code there ensures that the old signal_struct is no longer shared. Except
for posix-timers, all the state there is stuff you want to keep. So my
changes just keep the old structs when they are no longer shared, and all
the right state is retained (after clearing out posix-timers).

The final bug is that the cumulative statistics of dead threads and dead
child processes are lost in the abandoned signal_struct. This is also
fixed by holding on to it instead of replacing it.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

fef60c1b

[PATCH] make rlimit settings per-process instead of per-thread · 31180071

Roland McGrath authored 20 years ago

POSIX specifies that the limit settings provided by getrlimit/setrlimit are
shared by the whole process, not specific to individual threads. This
patch changes the behavior of those calls to comply with POSIX.

I've moved the struct rlimit array from task_struct to signal_struct, as it
has the correct sharing properties. (This reduces kernel memory usage per
thread in multithreaded processes by around 100/200 bytes for 32/64
machines respectively.) I took a fairly minimal approach to the locking
issues with the newly shared struct rlimit array. It turns out that all
the code that is checking limits really just needs to look at one word at a
time (one rlim_cur field, usually). It's only the few places like
getrlimit itself (and fork), that require atomicity in accessing a whole
struct rlimit, so I just used a spin lock for them and no locking for most
of the checks. If it turns out that readers of struct rlimit need more
atomicity where they are now cheap, or less overhead where they are now
atomic (e.g. fork), then seqcount is certainly the right thing to use for
them instead of readers using the spin lock. Though it's in signal_struct,
I didn't use siglock since the access to rlimits never needs to disable
irqs and doesn't overlap with other siglock uses. Instead of adding
something new, I overloaded task_lock(task->group_leader) for this; it is
used for other things that are not likely to happen simultaneously with
limit tweaking. To me that seems preferable to adding a word, but it would
be trivial (and arguably cleaner) to add a separate lock for these users
(or e.g. just use seqlock, which adds two words but is optimal for readers).

Most of the changes here are just the trivial s/->rlim/->signal->rlim/.

I stumbled across what must be a long-standing bug, in reparent_to_init.
It does:
memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim)));
when surely it was intended to be:
memcpy(current->rlim, init_task.rlim, sizeof(current->rlim));
As rlim is an array, the * in the sizeof expression gets the size of the
first element, so this just changes the first limit (RLIMIT_CPU). This is
for kernel threads, where it's clear that resetting all the rlimits is what
you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is
superfluous since it will now already have been reset to RLIM_INFINITY.

The other subtlety is removing:
tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY;
in exit_notify, which was to avoid a race signalling during self-reaping
exit. As the limit is now shared, a dying thread should not change it for
others. Instead, I avoid that race by checking current->state before the
RLIMIT_CPU check. (Adding one new conditional in that path is now required
one way or another, since if not for this check there would also be a new
race with self-reaping exit later on clearing current->signal that would
have to be checked for.)

The one loose end left by this patch is with process accounting.
do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the
accounting record. I left this as it was, but it is now changing a limit
that might be shared by other threads still running. I left this in a
dubious state because it seems to me that processing accounting may already
be more generally a dubious state when it comes to NPTL threads. I would
think you would want one record per process, with aggregate data about all
threads that ever lived in it, not a separate record for each thread.
I don't use process accounting myself, but if anyone is interested in
testing it out I could provide a patch to change it this way.

One final note, this is not 100% to POSIX compliance in regards to rlimits.
POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not
to each individual thread. I will provide patches later on to achieve that
change, assuming this patch goes in first.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

31180071

17 Sep, 2004 1 commit

[PATCH] fix posix-timers leak · c68f9a4d

Roland McGrath authored 20 years ago

Exec fails to clean up posix-timers.  This manifests itself in two ways, one
worse than the other.  In the single-threaded case, it just fails to clear out
the timers on exec.  POSIX says that exec clears out the timers from
timer_create (though not the setitimer ones), so it's wrong that a lingering
timer could fire after exec and kill the process with a signal it's not
expecting.  In the multi-threaded case, it not only leaves lingering timers,
but it leaks them entirely when it replaces signal_struct, so they will never
be freed by the process exiting after that exec.  The new per-user
RLIMIT_SIGPENDING actually limits the damage here, because a UID will fill up
its quota with leaked timers and then never be able to use timer_create again
(that's what my test program does).  But if you have many many untrusted UIDs,
this leak could be considered a DoS risk.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

c68f9a4d

15 Sep, 2004 1 commit

[PATCH] reduce [compat_]do_execve stack usage · 7d31ee9b

Denis Vlasenko authored 20 years ago

Allocating the 'struct linux_binprm' on the stack chews up too much
stackspace.

Just kmalloc it instead.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

7d31ee9b

27 Aug, 2004 1 commit

[PATCH] O(1) proc_pid_statm() · 6ac0a8d7

William Lee Irwin III authored 20 years ago

Merely removing down_read(&mm->mmap_sem) from task_vsize() is too
half-assed to let stand. The following patch removes the vma iteration
as well as the down_read(&mm->mmap_sem) from both task_mem() and
task_statm() and callers for the CONFIG_MMU=y case in favor of
accounting the various stats reported at the times of vma creation,
destruction, and modification. Unlike the 2.4.x patches of the same
name, this has no per-pte-modification overhead whatsoever.

This patch quashes end user complaints of top(1) being slow as well as
kernel hacker complaints of per-pte accounting overhead simultaneously.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

6ac0a8d7

24 Aug, 2004 3 commits

[PATCH] Reduce bkl usage in do_coredump · 77dc05e7

Josh Aas authored 20 years ago

A patch that reduces bkl usage in do_coredump.  I don't see anywhere that
it is necessary except for the call to format_corename, which is controlled
via sysctl (sys_sysctl holds the bkl).

Also make format_corename() static.
Signed-off-by: Josh Aas <josha@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

77dc05e7

[PATCH] i386 virtual memory layout rework · 8913d55b

Ingo Molnar authored 20 years ago

  Rework the i386 mm layout to allow applications to allocate more virtual
  memory, and larger contiguous chunks.


  - the patch is compatible with existing architectures that either make
    use of HAVE_ARCH_UNMAPPED_AREA or use the default mmap() allocator - there
    is no change in behavior.

  - 64-bit architectures can use the same mechanism to clean up 32-bit
    compatibility layouts: by defining HAVE_ARCH_PICK_MMAP_LAYOUT and
    providing a arch_pick_mmap_layout() function - which can then decide
    between various mmap() layout functions.

  - I also introduced a new personality bit (ADDR_COMPAT_LAYOUT) to signal
    older binaries that dont have PT_GNU_STACK.  x86 uses this to revert back
    to the stock layout.  I also changed x86 to not clear the personality bits
    upon exec(), like x86-64 already does.

  - once every architecture that uses HAVE_ARCH_UNMAPPED_AREA has defined
    its arch_pick_mmap_layout() function, we can get rid of
    HAVE_ARCH_UNMAPPED_AREA altogether, as a final cleanup.

  the new layout generation function (__get_unmapped_area()) got significant
  testing in FC1/2, so i'm pretty confident it's robust.


  Compiles & boots fine on an 'old' and on a 'new' x86 distro as well.

  The two known breakages were:

     http://www.redhatconfig.com/msg/67248.html

     [ 'cyzload' third-party utility broke. ]

     http://www.zipworld.com/au/~akpm/dde.tar.gz

     [ your editor broke :-) ]

  both were caused by application bugs that did:

	int ret = malloc();

	if (ret <= 0)
		failure;

  such bugs are easy to spot if they happen, and if it happens it's possible
  to work it around immediately without having to change the binary, via the
  setarch patch.

  No other application has been found to be affected, and this particular
  change got pretty wide coverage already over RHEL3 and exec-shield, it's in
  use for more than a year.


  The setarch utility can be used to trigger the compatibility layout on
  x86, the following version has been patched to take the `-L' option:

 	http://people.redhat.com/mingo/flexible-mmap/setarch-1.4-2.tar.gz

  "setarch -L i386 <command>" will run the command with the old layout.

From: Hugh Dickins <hugh@veritas.com>

  The problem is in the flexible mmap patch: arch_get_unmapped_area_topdown
  is liable to give your mmap vm_start above TASK_SIZE with vm_end wrapped;
  which is confusing, and ends up as that BUG_ON(mm->map_count).

  The patch below stops that behaviour, but it's not the full solution:
  wilson_mmap_test -s 1000 then simply cannot allocate memory for the large
  mmap, whereas it works fine non-top-down.

  I think it's wrong to interpret a large or rlim_infinite stack rlimit as
  an inviolable request to reserve that much for the stack: it makes much less
  VM available than bottom up, not what was intended.  Perhaps top down should
  go bottom up (instead of belly up) when it fails - but I'd probably better
  leave that to Ingo.

  Or perhaps the default should place stack below text (as WLI suggested and
  ELF intended, with its text defaulting to 0x08048000, small progs sharing
  page table between stack and text and data); with a further personality for
  those needing bigger stack.

From: Ingo Molnar <mingo@elte.hu>

  - fall back to the bottom-up layout if the stack can grow unlimited (if
  the stack ulimit has been set to RLIM_INFINITY)

  - try the bottom-up allocator if the top-down allocator fails - this can
  utilize the hole between the true bottom of the stack and its ulimit, as a
  last-resort effort.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

8913d55b

[PATCH] sched: cleanup, improve sched <=> fork APIs · 3632d86a

Nick Piggin authored 20 years ago

Move balancing and child-runs-first logic from fork.c into sched.c where
it belongs.

* Consolidate wake_up_forked_process and wake_up_forked_thread into
  wake_up_new_process, and pass in clone_flags as suggested by Linus.  This
  removes a lot of code duplication and allows all logic to be handled in that
  function.

* Don't do balance-on-clone balancing for vfork'ed threads.

* Don't do set_task_cpu or balance one clone in wake_up_new_process. 
  Instead do it in sched_fork to fix set_cpus_allowed races.

* Don't do child-runs-first for CLONE_VM processes, as there is obviously no
  COW benifit to be had.  This is a big one, it enables Andi's workload to run
  well without clone balancing, because the OpenMP child threads can get
  balanced off to other nodes *before* they start running and allocating
  memory.

* Rename sched_balance_exec to sched_exec: hide the policy from the API.


From: Ingo Molnar <mingo@elte.hu>

  rename wake_up_new_process -> wake_up_new_task.

  in sched.c we are gradually moving away from the overloaded 'process' or
  'thread' notion to the traditional task (or context) naming.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

3632d86a

23 Aug, 2004 1 commit

[PATCH] proc fs task name locking fix · 4b4b699d

Mike Kravetz authored 20 years ago

Races have been observed between excec-time overwriting of task->comm and
/proc accesses to the same data.  This causes environment string
information to appear in /proc.

Fix that up by taking task_lock() around updates to and accesses to
task->comm.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

4b4b699d

18 Jul, 2004 1 commit

[PATCH] NX: clean up legacy binary support · 1bb0fa18

Ingo Molnar authored 20 years ago

This cleans up legacy x86 binary support by introducing a new
personality bit: READ_IMPLIES_EXEC, and implements Linus' suggestion to
add the PROT_EXEC bit on the two affected syscall entry places,
sys_mprotect() and sys_mmap().  If this bit is set then PROT_READ will
also add the PROT_EXEC bit - as expected by legacy x86 binaries.  The
ELF loader will automatically set this bit when it encounters a legacy
binary.

This approach avoids the problems the previous ->def_flags solution
caused.  In particular this patch fixes the PROT_NONE problem in a
cleaner way (http://lkml.org/lkml/2004/7/12/227), and it should fix the
ia64 PROT_EXEC problem reported by David Mosberger.  Also,
mprotect(PROT_READ) done by legacy binaries will do the right thing as
well. 

the details:

- the personality bit is added to the personality mask upon exec(),
  within the ELF loader, but is not cleared (see the exceptions below). 
  This means that if an environment that already has the bit exec()s a
  new-style binary it will still get the old behavior.

- one exception are setuid/setgid binaries: these will reset the
  bit - thus local attackers cannot manually set the bit and circumvent
  NX protection. Legacy setuid binaries will still get the bit through
  the ELF loader. This gives us maximum flexibility in shaping
  compatibility environments.

- selinux also clears the bit when switching SIDs via exec().

- x86 is the only arch making use of READ_IMPLIES_EXEC currently. Other
  arches will have the pre-NX-patch protection setup they always had.

I have booted an old distro [RH 7.2] and two new PT_GNU_STACK distros
[SuSE 9.2 and FC2] on an NX-capable CPU - they work just fine and all
the mapping details are right. I've checked the PROT_NONE test-utility
as well and it works as expected. I have checked various setuid
scenarios as well involving legacy and new-style binaries.

an improved setarch utility can be used to set the personality bit
manually:

	http://redhat.com/~mingo/nx-patches/setarch-1.4-3.tar.gz

the new '-X' flag does it, e.g.:

	./setarch -X linux /bin/cat /proc/self/maps

will trigger the old protection layout even on a new distro.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

1bb0fa18

29 Jun, 2004 1 commit

[PATCH] binfmt misc fd passing via ELF aux vector · 191312cc

Yoav Zach authored 20 years ago

The proposed patch uses the aux-vector to pass the fd of the open misc
binary to the interpreter, instead of using argv[1] for that purpose.

Previous patch - open_nonreadable_binaries, offered the option of
binfmt_misc opening the binary on behalf of the interpreter. In case
binfmt_misc is requested to do that it would pass the file-descriptor of
the open binary to the interpreter as its second argument (argv[1]). This
method of passing the file descriptor was suspected to be problematic,
since it changes the command line that users expect to see when using tools
such as 'ps' and 'top'.

The proposed patch changes the method of passing the fd of the open binary
to the translator. Instead of passing it as an argument, binfmt_misc will
request the ELF loader to pass it as a new element in the aux-vector that
it prepares on the stack for ELF interpreter. With this patch, argv[1]
will hold the full path to the binary regardless of whether it opened it or
not.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

191312cc

27 Jun, 2004 1 commit

[PATCH] NX (No eXecute) support for x86 · 36bc33ba

Ingo Molnar authored 20 years ago

we'd like to announce the availability of the following kernel patch:

     http://redhat.com/~mingo/nx-patches/nx-2.6.7-rc2-bk2-AE

which makes use of the 'NX' x86 feature pioneered in AMD64 CPUs and for
which support has also been announced by Intel. (other x86 CPU vendors,
Transmeta and VIA announced support as well. Windows support for NX has
also been announced by Microsoft, for their next service pack.) The NX
feature is also being marketed as 'Enhanced Virus Protection'. This
patch makes sure Linux has full support for this hardware feature on x86
too.

What does this patch do? The pagetable format of current x86 CPUs does
not have an 'execute' bit. This means that even if an application maps a
memory area without PROT_EXEC, the CPU will still allow code to be
executed in this memory. This property is often abused by exploits when
they manage to inject hostile code into this memory, for example via a
buffer overflow.

The NX feature changes this and adds a 'dont execute' bit to the PAE
pagetable format. But since the flag defaults to zero (for compatibility
reasons), all pages are executable by default and the kernel has to be
taught to make use of this bit.

If the NX feature is supported by the CPU then the patched kernel turns
on NX and it will enforce userspace executability constraints such as a
no-exec stack and no-exec mmap and data areas. This means less chance
for stack overflows and buffer-overflows to cause exploits.

furthermore, the patch also implements 'NX protection' for kernelspace
code: only the kernel code and modules are executable - so even
kernel-space overflows are harder (in some cases, impossible) to
exploit. Here is how kernel code that tries to execute off the stack is 
stopped:

 kernel tried to access NX-protected page - exploit attempt? (uid: 500)
 Unable to handle kernel paging request at virtual address f78d0f40
  printing eip:
 ...

The patch is based on a prototype NX patch written for 2.4 by Intel -
special thanks go to Suresh Siddha and Jun Nakajima @ Intel. The
existing NX support in the 64-bit x86_64 kernels has been written by
Andi Kleen and this patch is modeled after his code.

Arjan van de Ven has also provided lots of feedback and he has
integrated the patch into the Fedora Core 2 kernel. Test rpms are
available for download at:

    http://redhat.com/~arjanv/2.6/RPMS.kernel/

the kernel-2.6.6-1.411 rpms have the NX patch applied.

here's a quickstart to recompile the vanilla kernel from source with the
NX patch:

    http://redhat.com/~mingo/nx-patches/QuickStart-NX.txt

update:

 - make the heap non-executable on PT_GNU_STACK binaries.

 - make all data mmap()s (and the heap) executable on !PT_GNU_STACK
   (legacy) binaries. This has no effect on non-NX CPUs, but should be
   much more compatible on NX CPUs. The only effect it has it has on
   non-NX CPUs is the extra 'x' bit displayed in /proc/PID/maps.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

36bc33ba

18 Jun, 2004 2 commits

[PATCH] Clean up asm/pgalloc.h include · 1c60f076

Russell King authored 20 years ago

This patch cleans up needless includes of asm/pgalloc.h from the fs/
kernel/ and mm/ subtrees.  Compile tested on multiple ARM platforms, and
x86, this patch appears safe.

This patch is part of a larger patch aiming towards getting the include of
asm/pgtable.h out of linux/mm.h, so that asm/pgtable.h can sanely get at
things like mm_struct and friends.

I suggest testing in -mm for a while to ensure there aren't any hidden arch
issues.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

1c60f076

[PATCH] Handle non-readable binfmt_misc executables · 79baf43b

Yoav Zach authored 20 years ago

<background>

I work in a group that works on enabling the IA-32 Execution Layer
(http://www.intel.com/pressroom/archive/releases/20040113comp.htm) on Linux.
In a few words - this is a dynamic translator for IA-32 binaries on IPF
platform.  Following David Mosberger's advice - we use the binfmt_misc
mechanism for the invocation of the translator whenever the user tries to
exec an IA-32 binary.

The EL is meant to help in the migration path from IA-32 to IPF.  From our
beta customers we learnt that at first stage - they tend to keep their
environment mostly intact, using the legacy IA-32 binaries.

Such an environment has, naturally, setuid and non-readable binaries.  It
will be useless to ask the administrator to change the settings of such an
environment - some of them are very complex, and the administrators are
reluctant to make any changes in a system that already proved itself to be
robust and secure.  So, our target with these patches is not to enhance the
support for scripts but rather to allow a translator to be integrated into a
working environment that is not (and should not be) aware to the fact it's
being emulated.

As I said before - it is practically hopeless to expect an administrator of
such a system to change it so that it will suit the current behavior of
binfmt_misc.  But, even if we could do that,

I'm not sure it would be a good idea - these changes are likely to be less
secure than the suggested patches -

- In order to execute non-readable binaries the binary will have to be made
  readable, which is obviously less secure than allowing only a trusted
  translator to read it

- There will be no way for the translator to calculate the accurate
  AT_SECURE value for the translated process.  This might end up with the
  translated process running in a non-secured mode when it actually needs to
  be secured.

</background>


I prepared a patch that solves a couple of problems that interpreters have
when invoked via binfmt_misc.  currently -

1) such interpreters cannot open non-readable binaries

2) the processes will have their credentials and security attributes
   calculated according to interpreter permissions and not those of the
   original binary

the proposed patch solves these problems by -

1) opening the binary on behalf of the interpreter and passing its fd
   instead of the path as argv[1] to the interpreter

2) calling prepare_binprm with the file struct of the binary and not the
   one of the interpreter

The new functionality is enabled by adding a special flag to the registration
string.  If this flag is not added then old behavior is not changed.

A preliminary version of this patch was sent to the list on 9/1/2003 with the
title "[PATCH]: non-readable binaries - binfmt_misc 2.6.0-test4".  This new
version fixes the concerns that were raised by the patch, except of calling
unshare_files() before allocating a new fd.  this is because this feature did
not enter 2.6 yet.


Arun Sharma <arun.sharma@intel.com> says:

We were going through an internal review of this patch:

http://marc.theaimsgroup.com/?l=linux-kernel&m=107424598901720&w=2

which is in your tree already.  I'm not sure if this line of code got
sufficient review.

+               /* call prepare_binprm before switching to interpreter's file
+                * so that all security calculation will be done according to
+                * binary and not interpreter */
+               retval = prepare_binprm(bprm);

The case that concerns me is: unprivileged interpreter and a privileged
binary.  One can use binfmt_misc to execute untrusted code (interpreter) with
elevated privileges.  One could argue that all binfmt_misc interpreters are
trusted, because only root can register them.  But that's a change from the
traditional behavior of binfmt_misc (and binfmt_script).


(Update):

Arun pointed out that calculating the process credentials according to the
binary that needs to be translated is a bit risky, since it requires the
administrator to pay extra attention not to register an interpreter which is
not intended to run with root credentials.

After discussing this issue with him, I would like to propose a modified
patch: The old patch did 2 things - 1) open the binary for reading and 2)
calculate the credentials according to the binary.

I removed the riskier part of changing the credentials calculation, so the
revised patch only opens the binary for reading.  It also includes few words
of warning in the description of the 'open-binary' feature in
binfmt_misc.txt, and makes the function entry_status print the flags in use.

As for the 'credentials' part of the patch, I will prepare a separate patch
for it and send it again to the LKML, describe the problem and ask for people
comments.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

79baf43b

03 Jun, 2004 1 commit

[PATCH] sched: balance-on-exec fix · eb313b41

Andrew Morton authored 20 years ago

From: Jack Steiner <steiner@sgi.com>

It looks like the call to sched_balance_exec() from do_execve() is in the
wrong spot.  The code calls sched_balance_exec() before determining whether
"filename" actually exists.

In many cases, users have several entries in $PATH.  If a full path name is
not specified on the 'exec" call, the library code iterates thru the files
in the PATH list until it finds the program.  This can result is numerous
migrations of the parent process before the program is actually found.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

eb313b41

29 May, 2004 1 commit

[PATCH] sparse: bits and pieces · 4d306504

Alexander Viro authored 20 years ago

Independent minor bits caught by sparse:

 - paride.h mixing void and int in ? :, used always in a void context
   ide-iops.c return insw() - insw is void()
 - scsi/constants.c uses undefined macros in #if; added #define to 0 in
   case that used to leave it undefined
 - usb/host/hcd.h: fixed-point arithmetics in constant
 - fs/exec.c: missing UL on a large constant
 - fs/locks.c: #if where #ifdef should've been
 - fs.h: missing UL on MAX_LFS_FILESIZE in 64bit case

4d306504

22 May, 2004 8 commits

[PATCH] rmap 39 add anon_vma rmap · 8aa3448c

Andrew Morton authored 20 years ago

From: Hugh Dickins <hugh@veritas.com>

Andrea Arcangeli's anon_vma object-based reverse mapping scheme for anonymous
pages.  Instead of tracking anonymous pages by pte_chains or by mm, this
tracks them by vma.  But because vmas are frequently split and merged
(particularly by mprotect), a page cannot point directly to its vma(s), but
instead to an anon_vma list of those vmas likely to contain the page - a list
on which vmas can easily be linked and unlinked as they come and go.  The vmas
on one list are all related, either by forking or by splitting.

This has three particular advantages over anonmm: that it can cope
effortlessly with mremap moves; and no longer needs page_table_lock to protect
an mm's vma tree, since try_to_unmap finds vmas via page -> anon_vma -> vma
instead of using find_vma; and should use less cpu for swapout since it can
locate its anonymous vmas more quickly.

It does have disadvantages too: a lot more change in mmap.c to deal with
anon_vmas, though small straightforward additions now that the vma merging has
been refactored there; more lowmem needed for each anon_vma and vma structure;
an additional restriction on the merging of vmas (cannot be merged if already
assigned different anon_vmas, since then their pages will be pointing to
different heads).

(There would be no need to enlarge the vma structure if anonymous pages
belonged only to anonymous vmas; but private file mappings accumulate
anonymous pages by copy-on-write, so need to be listed in both anon_vma and
prio_tree at the same time.  A different implementation could avoid that by
using anon_vmas only for purely anonymous vmas, and use the existing prio_tree
to locate cow pages - but that would involve a long search for each single
private copy, probably not a good idea.)

Where before the vm_pgoff of a purely anonymous (not file-backed) vma was
meaningless, now it represents the virtual start address at which that vma is
mapped - which the standard file pgoff manipulations treat linearly as vmas
are split and merged.  But if mremap moves the vma, then it generally carries
its original vm_pgoff to the new location, so pages shared with the old
location can still be found.  Magic.

Hugh has massaged it somewhat: building on the earlier rmap patches, this
patch is a fifth of the size of Andrea's original anon_vma patch.  Please note
that this posting will be his first sight of this patch, which he may or may
not approve.

8aa3448c

[PATCH] rmap 37 page_add_anon_rmap vma · e1fd9cc9

Andrew Morton authored 20 years ago

From: Hugh Dickins <hugh@veritas.com>

Silly final patch for anonmm rmap: change page_add_anon_rmap's mm arg to vma
arg like anon_vma rmap, to smooth the transition between them.

e1fd9cc9

[PATCH] rmap 33 install_arg_page vma · 114c71ee

Andrew Morton authored 20 years ago

From: Hugh Dickins <hugh@veritas.com>

anon_vma will need to pass vma to put_dirty_page, so change it and its
various callers (setup_arg_pages and its 32-on-64-bit arch variants); and
please, let's rename it to install_arg_page.

Earlier attempt to do this (rmap 26 __setup_arg_pages) tried to clean up
those callers instead, but failed to boot: so now apply rmap 27's memset
initialization of vmas to these callers too; which relieves them from
needing the recently included linux/mempolicy.h.

While there, moved install_arg_page's flush_dcache_page up before
page_table_lock - doesn't in fact matter at all, just saves one worry when
researching flush_dcache_page locking constraints.

114c71ee

[PATCH] rmap 27 memset 0 vma · c8ba2065

Andrew Morton authored 20 years ago

From: Hugh Dickins <hugh@veritas.com>

We're NULLifying more and more fields when initializing a vma
(mpol_set_vma_default does that too, if configured to do anything).  Now use
memset to avoid specifying fields, and save a little code too.

(Yes, I realize anon_vma will want to set vm_pgoff non-0, but I think that
will be better handled at the core, since anon vm_pgoff is negotiable up until
an anon_vma is actually assigned.)

c8ba2065

[PATCH] rmap 16: pretend prio_tree · fc96c90f

Andrew Morton authored 20 years ago

From: Hugh Dickins <hugh@veritas.com>

Pave the way for prio_tree by switching over to its interfaces, but actually
still implement them with the same old lists as before.

Most of the vma_prio_tree interfaces are straightforward. The interesting one
is vma_prio_tree_next, used to search the tree for all vmas which overlap the
given range: unlike the list_for_each_entry it replaces, it does not find
every vma, just those that match.

But this does leave handling of nonlinear vmas in a very unsatisfactory state:
for now we have to search again over the maximum range to find all the
nonlinear vmas which might contain a page, which of course takes away the
point of the tree. Fixed in later patch of this batch.

There is no need to initialize vma linkage all over, just do it before
inserting the vma in list or tree. /proc/pid/statm had an odd test for its
shared count: simplified to an equivalent test on vm_file.

fc96c90f

[PATCH] small numa api fixups · e52c02f7

Andrew Morton authored 20 years ago

From: Christoph Hellwig <hch@lst.de>

- don't include mempolicy.h in sched.h and mm.h when a forward delcaration
  is enough.  Andi argued against that in the past, but I'd really hate to add
  another header to two of the includes used in basically every driver when we
  can include it in the six files actually needing it instead (that number is
  for my ppc32 system, maybe other arches need more include in their
  directories)

- make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives
  us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.

e52c02f7

[PATCH] numa api: Add VMA hooks for policy · c78b023f

Andrew Morton authored 20 years ago

From: Andi Kleen <ak@suse.de>

NUMA API adds a policy to each VMA.  During VMA creattion, merging and
splitting these policies must be handled properly.  This patch adds the calls
to this. 

It is a nop when CONFIG_NUMA is not defined.

c78b023f

[PATCH] rmap 9 remove pte_chains · 123e4df7

Andrew Morton authored 20 years ago

From: Hugh Dickins <hugh@veritas.com>

Lots of deletions: the next patch will put in the new anon rmap, which
should look clearer if first we remove all of the old pte-pointer-based
rmap from the core in this patch - which therefore leaves anonymous rmap
totally disabled, anon pages locked in memory until process frees them.

Leave arch files (and page table rmap) untouched for now, clean them up in
a later batch.  A few constructive changes amidst all the deletions:

Choose names (e.g.  page_add_anon_rmap) and args (e.g.  no more pteps) now
so we need not revisit so many files in the next patch.  Inline function
page_dup_rmap for fork's copy_page_range, simply bumps mapcount under lock.
 cond_resched_lock in copy_page_range.  Struct page rearranged: no pte
union, just mapcount moved next to atomic count, so two ints can occupy one
long on 64-bit; i386 struct page now 32 bytes even with PAE.  Never pass
PageReserved to page_remove_rmap, only do_wp_page did so.


From: Hugh Dickins <hugh@veritas.com>

  Move page_add_anon_rmap's BUG_ON(page_mapping(page)) inside the rmap_lock
  (well, might as well just check mapping if !mapcount then): if this page is
  being mapped or unmapped on another cpu at the same time, page_mapping's
  PageAnon(page) and page->mapping are volatile.

  But page_mapping(page) is used more widely: I've a nasty feeling that
  clear_page_anon, page_add_anon_rmap and/or page_mapping need barriers added
  (also in 2.6.6 itself),

123e4df7

26 Apr, 2004 1 commit

[PATCH] credentials locking fix · 10c189cd

Andrew Morton authored 20 years ago

From: Chris Wright <chrisw@osdl.org>

Contributions from:
Stephen Smalley <sds@epoch.ncsc.mil>
Andy Lutomirski <luto@stanford.edu>

During exec the LSM bprm_apply_creds() hooks may tranisition the program to a
new security context (like setuid binaries).  The security context of the new
task is dependent on state such as if the task is being ptraced.  

ptrace_detach() doesn't take the task_lock() when clearing task->ptrace.  So
there is a race possible where a process starts off being ptraced, the
malicious ptracer detaches and if any checks agains task->ptrace are done more
than once, the results are indeterminate.

This patch ensures task_lock() is held while bprm_apply_creds() hooks are
called, keeping it safe against ptrace_attach() races.  Additionally, tests
against task->ptrace (and ->fs->count, ->files->count and ->sighand->count all
of which signify potential unsafe resource sharing during a security context
transition) are done only once the results are passed down to hooks, making it
safe against ptrace_detach() races.

Additionally:

- s/must_must_not_trace_exec/unsafe_exec/
- move unsafe_exec() call above security_bprm_apply_creds() call rather than
  in call for readability.
- fix dummy hook to honor the case where root is ptracing
- couple minor formatting/spelling fixes

10c189cd

21 Apr, 2004 1 commit

[PATCH] compute_creds race · b7fbe52c

Andrew Morton authored 20 years ago

From: Andy Lutomirski <luto@myrealbox.com>

Fixes from me, Olaf Dietsche <olaf+list.linux-kernel@olafdietsche.de>

In fs/exec.c, compute_creds does:

	task_lock(current);
	if (bprm->e_uid != current->uid || bprm->e_gid != current->gid) {
                 current->mm->dumpable = 0;

		if (must_not_trace_exec(current)
		    || atomic_read(&current->fs->count) > 1
		    || atomic_read(&current->files->count) > 1
		    || atomic_read(&current->sighand->count) > 1) {
			if(!capable(CAP_SETUID)) {
				bprm->e_uid = current->uid;
				bprm->e_gid = current->gid;
			}
		}
	}

         current->suid = current->euid = current->fsuid = bprm->e_uid;
         current->sgid = current->egid = current->fsgid = bprm->e_gid;

	task_unlock(current);

	security_bprm_compute_creds(bprm);

I assume the task_lock is to prevent another process (on SMP or preempt)
from ptracing the execing process between the check and the assignment.  If
that's the concern then the fact that the lock is dropped before the call
to security_brpm_compute_creds means that, if security_bprm_compute_creds
does anything interesting, there's a race.

For my (nearly complete) caps patch, I obviously need to fix this.  But I
think it may be exploitable now.  Suppose there are two processes, A (the
malicious code) and B (which uses exec).  B starts out unprivileged (A and
B have, e.g., uid and euid = 500).

1. A ptraces B.

2. B calls exec on some setuid-root program.

3. in cap_bprm_set_security, B sets bprm->cap_permitted to the full
   set.

4. B gets to compute_creds in exec.c, calls task_lock, and does not
   change its uid.

5. B calls task_unlock.

6. A detaches from B (on preempt or SMP).

7. B gets to task_lock in cap_bprm_compute_creds, changes its
   capabilities, and returns from compute_creds into load_elf_binary.

8. load_elf_binary calls create_elf_tables (line 852 in 2.6.5-mm1),
   which calls cap_bprm_secureexec (through LSM), which returns false (!).

9. exec finishes.

The setuid program is now running with uid=euid=500 but full permitted
capabilities.  There are two (or three) ways to effectively get local root
now:

1.  IIRC, linux 2.4 doesn't check capabilities in ptrace, so A could
   just ptrace B again.

2. LD_PRELOAD.

3.  There are probably programs that will misbehave on their own under
   these circumstances.

Is there some reason why this is not doable?

The patch renames bprm_compute_creds to bprm_apply_creds and moves all uid
logic into the hook, where the test and the resulting modification can both
happen under task_lock().

This way, out-of-tree LSMs will fail to compile instead of malfunctioning. 
It should also make life easier for LSMs and will certainly make it easier
for me to finish the cap patch.

b7fbe52c

17 Apr, 2004 1 commit

[PATCH] Fix exec in multithreaded application · bea63af0

Petr Vandrovec authored 20 years ago

The recent controlling terminal changes broke exec from multithreaded
application because de_thread was not upgraded to new arrangement.  I
know that I should not have LD_PRELOAD library which automatically
creates one thread, but it looked like a cool solution to the problem I
had.

de_thread must initialize the controlling terminal information in the
new thread group.

bea63af0

12 Apr, 2004 4 commits

[PATCH] rmap 1 linux/rmap.h · 4c4acd24

Andrew Morton authored 20 years ago

From: Hugh Dickins <hugh@veritas.com>

First of a batch of three rmap patches: this initial batch of three paving
the way for a move to some form of object-based rmap (probably Andrea's, but
drawing from mine too), and making almost no functional change by itself.  A
few days will intervene before the next batch, to give the struct page
changes in the second patch some exposure before proceeding.

rmap 1 create include/linux/rmap.h

Start small: linux/rmap-locking.h has already gathered some declarations
unrelated to locking, and the rest of the rmap declarations were over in
linux/swap.h: gather them all together in linux/rmap.h, and rename the
pte_chain_lock to rmap_lock.

4c4acd24

[PATCH] fix posix-timers to have proper per-process scope · 0e568881

Andrew Morton authored 20 years ago

From: Roland McGrath <roland@redhat.com>

The posix-timers implementation associates timers with the creating thread
and destroys timers when their creator thread dies.  POSIX clearly
specifies that these timers are per-process, and a timer should not be torn
down when the thread that created it exits.  I hope there won't be any
controversy on what the correct semantics are here, since POSIX is clear
and the Linux feature is called "posix-timers".

The attached program built with NPTL -lrt -lpthread demonstrates the bug.
The program is correct by POSIX, but fails on Linux.  Note that a until
just the other day, NPTL had a trivial bug that always disabled its use of
kernel timer syscalls (check strace for lack of timer_create/SYS_259).  So
unless you have built your own NPTL libs very recently, you probably won't
see the kernel calls actually used by this program.

Also attached is my patch to fix this.  It (you guessed it) moves the
posix_timers field from task_struct to signal_struct.  Access is now
governed by the siglock instead of the task lock.  exit_itimers is called
from __exit_signal, i.e.  only on the death of the last thread in the
group, rather than from do_exit for every thread.  Timers' it_process
fields store the group leader's pointer, which won't die.  For the case of
SIGEV_THREAD_ID, I hold a ref on the task_struct for it_process to stay
robust in case the target thread dies; the ref is released and the dangling
pointer cleared when the timer fires and the target thread is dead.  (This
should only come up in a buggy user program, so noone cares exactly how the
kernel handles that case.  But I think what I did is robust and sensical.)

/* Test for bogus per-thread deletion of timers.  */

#include <stdio.h>
#include <error.h>
#include <time.h>
#include <signal.h>
#include <stdint.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <unistd.h>
#include <pthread.h>

/* Creating timers in another thread should work too.  */
static void *do_timer_create(void *arg)
{
	struct sigevent *const sigev = arg;
	timer_t *const timerId = sigev->sigev_value.sival_ptr;
	if (timer_create(CLOCK_REALTIME, sigev, timerId) < 0) {
		perror("timer_create");
		return NULL;
	}
	return timerId;
}

int main(void)
{
	int i, res;
	timer_t timerId;
	struct itimerspec itval;
	struct sigevent sigev;

	itval.it_interval.tv_sec = 2;
	itval.it_interval.tv_nsec = 0;
	itval.it_value.tv_sec = 2;
	itval.it_value.tv_nsec = 0;

	sigev.sigev_notify = SIGEV_SIGNAL;
	sigev.sigev_signo = SIGALRM;
	sigev.sigev_value.sival_ptr = (void *)&timerId;

	for (i = 0; i < 100; i++) {
		printf("cnt = %d\n", i);

		pthread_t thr;
		res = pthread_create(&thr, NULL, &do_timer_create, &sigev);
		if (res) {
			error(0, res, "pthread_create");
			continue;
		}
		void *val;
		res = pthread_join(thr, &val);
		if (res) {
			error(0, res, "pthread_join");
			continue;
		}
		if (val == NULL)
			continue;

		res = timer_settime(timerId, 0, &itval, NULL);
		if (res < 0)
			perror("timer_settime");

		res = timer_delete(timerId);
		if (res < 0)
			perror("timer_delete");
	}

	return 0;
}

0e568881

[PATCH] Non-Exec stack support · 01cc53b2

Andrew Morton authored 20 years ago

From: Kurt Garloff <garloff@suse.de>

A patch to parse the elf binaries for a PT_GNU_STACK section to set the stack
non-executable if possible.  Most parts have been shamelessly stolen from
Ingo Molnar's more ambitious stackshield
http://people.redhat.com/mingo/exec-shield/exec-shield-2.6.4-C9

The toolchain has meanwhile support for marking the binaries with a
PT_GNU_STACK section wwithout x bit as needed.

If no such section is found, we leave the stack to whatever the arch defaults
to.  If there is one, we explicitly disabled the VM_EXEC bit if no x bit is
found, otherwise explicitly enable.

01cc53b2

[PATCH] move job control fields from task_struct to signal_struct · 7860b371

Andrew Morton authored 20 years ago

From: Roland McGrath <roland@redhat.com>

This patch moves all the fields relating to job control from task_struct to
signal_struct, so that all this info is properly per-process rather than
being per-thread.

7860b371

25 Feb, 2004 1 commit

[PATCH] add syscalls.h · 0bab0642

Andrew Morton authored 21 years ago

From: "Randy.Dunlap" <rddunlap@osdl.org>

Add syscalls.h, which contains prototypes for the kernel's system calls.
Replace open-coded declarations all over the place. This patch found a
couple of prior bugs. It appears to be more important with -mregparm=3 as we
discover more asmlinkage mismatches.

Some syscalls have arch-dependent arguments, so their prototypes are in the
arch-specific unistd.h. Maybe it should have been asm/syscalls.h, but there
were already arch-specific syscall prototypes in asm/unistd.h...

Tested on x86, ia64, x86_64, ppc64, s390 and sparc64. May cause
trivial-to-fix build breakage on other architectures.

0bab0642

18 Feb, 2004 1 commit

[PATCH] Enable coredumps > 2GB · 95b387a4

Andrew Morton authored 21 years ago

From: Andi Kleen <ak@muc.de>

Some x86-64 users were complaining that coredumps >2GB don't work.

This will enable large coredump for everybody.  Apparently the 32bit
gdb/binutils cannot handle them, but I hear the binutils people are working
on fixing that.  I doubt it will harm people - unreadable coredumps are not
worse than no coredump and it won't make any difference in space usage if
you get a 1.99GB or a 2.5GB coredump.  So just enable it unconditionally.
If it should be really a problem for 32bit the rlimit defaults in
resource.h could be changed.

For file systems that don't support O_LARGEFILE you should just get an
truncated coredumps for big address spaces.

95b387a4

19 Jan, 2004 1 commit

[PATCH] nfs: Fix an open intent bug · 7de3a7b2

Andrew Morton authored 21 years ago

From: Trond Myklebust <trond.myklebust@fys.uio.no>

The following patch fixes a bug when initializing the intent structure
in sys_uselib(): intents use the FMODE_READ convention rather than
O_RDONLY.

It also adds a missing open intent to open_exec(). This ensures that NFS
clients will do the necessary close-to-open data cache consistency
checking.

7de3a7b2

30 Dec, 2003 1 commit

[PATCH] Fix memleak on execve failure · 7764b6de

Andrew Morton authored 21 years ago

From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp>

I found linux-2.6.0-test11 leaks memory when execve fails.  I've also
checked the bitkeeper tree and the problem seems to be unchanged.

The attached patch is a partial backout of bitkeeper rev.  1.87 of
fs/exec.c.  I guess the original change was a simple mistake.
(free_arg_pages() is a NOP when CONFIG_MMU is defined).

7764b6de

29 Dec, 2003 2 commits

[PATCH] use new steal_locks helper · 02c541ec

Andrew Morton authored 21 years ago

From: Chris Wright <chrisw@osdl.org>

Use the new steal_locks helper to steal the locks from the old files struct
left from unshare_files() when the new unshared struct files gets used.

02c541ec

[PATCH] use new unshare_files helper · 04e9bcb4

Andrew Morton authored 21 years ago

From: Chris Wright <chrisw@osdl.org>

Use unshare_files during binary loading to eliminate potential leak of
the binary's fd installed during execve().  As is, this breaks
binfmt_som.c

04e9bcb4

09 Oct, 2003 1 commit

Revert the process group accessor functions. They are buggy, and · 06349d9d

Linus Torvalds authored 21 years ago

cause NULL pointer references in /proc.

Moreover, it's questionable whether the whole thing makes sense at all. 
Per-thread state is good.

Cset exclude: davem@nuts.ninka.net|ChangeSet|20031005193942|01097
Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180420|42200
Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180411|42211

06349d9d

05 Oct, 2003 1 commit

[PATCH] move job control fields from task_struct to · 1bd563fd

Andrew Morton authored 21 years ago

From: Roland McGrath <roland@redhat.com>

This patch completes what was started with the `process_group' accessor
function, moving all the job control-related fields from task_struct into
signal_struct and using process_foo accessor functions to read them. All
these things are per-process in POSIX, none per-thread. Off hand it's hard
to come up with the hairy MT scenarios in which the existing code would do
insane things, but trust me, they're there. At any rate, all the uses
being done via inline accessor functions now has got to be all good.

I did a "make allyesconfig" build and caught the few random drivers and
whatnot that referred to these fields. I was surprised to find how few
references to ->tty there really were to fix up. I'm sure there will be a
few more fixups needed in non-x86 code. The only actual testing of a
running kernel with these patches I've done is on my normal minimal x86
config. Everything works fine as it did before as far as I can tell.

One issue that may be of concern is the lack of any locking on multiple
threads diddling these fields. I don't think it really matters, though
there might be some obscure races that could produce inconsistent job
control results. Nothing shattering, I'm sure; probably only something
like a multi-threaded program calling setsid while its other threads do tty
i/o, which never happens in reality. This is the same situation we get by
using ->group_leader->foo without other synchronization, which seemed to be
the trend and noone was worried about it.

1bd563fd