An error occurred fetching the project authors.
- 18 Oct, 2004 3 commits
-
-
Ingo Molnar authored
This patch fixes all the preempt-after-task->state-is-TASK_DEAD problems we had. Right now, the moment procfs does a down() that sleeps in proc_pid_flush() [it could] our TASK_DEAD state is zapped and we might be back to TASK_RUNNING to and we trigger this assert: schedule(); BUG(); /* Avoid "noreturn function does return". */ for (;;) ; I have split out TASK_ZOMBIE and TASK_DEAD into a separate p->exit_state field, to allow the detaching of exit-signal/parent/wait-handling from descheduling a dead task. Dead-task freeing is done via PF_DEAD. Tested the patch on x86 SMP and UP, but all architectures should work fine. Signed-off-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
Roland McGrath authored
I've found some problems with exec and fixed them with this patch to de_thread. The second problem is that a multithreaded exec loses all pending signals. This is violation of POSIX rules. But a moment's thought will show it's also just not desireable: if you send a process a SIGTERM while it's in the middle of calling exec, you expect either the original program in that process or the new program being exec'd to handle that signal or be killed by it. As it stands now, you can try to kill a process and have that signal just evaporate if it's multithreaded and calls exec just then. I really don't know what the rationale was behind the de_thread code that allocates a new signal_struct. It doesn't make any sense now. The other code there ensures that the old signal_struct is no longer shared. Except for posix-timers, all the state there is stuff you want to keep. So my changes just keep the old structs when they are no longer shared, and all the right state is retained (after clearing out posix-timers). The final bug is that the cumulative statistics of dead threads and dead child processes are lost in the abandoned signal_struct. This is also fixed by holding on to it instead of replacing it. Signed-off-by:
Roland McGrath <roland@redhat.com> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
Roland McGrath authored
POSIX specifies that the limit settings provided by getrlimit/setrlimit are shared by the whole process, not specific to individual threads. This patch changes the behavior of those calls to comply with POSIX. I've moved the struct rlimit array from task_struct to signal_struct, as it has the correct sharing properties. (This reduces kernel memory usage per thread in multithreaded processes by around 100/200 bytes for 32/64 machines respectively.) I took a fairly minimal approach to the locking issues with the newly shared struct rlimit array. It turns out that all the code that is checking limits really just needs to look at one word at a time (one rlim_cur field, usually). It's only the few places like getrlimit itself (and fork), that require atomicity in accessing a whole struct rlimit, so I just used a spin lock for them and no locking for most of the checks. If it turns out that readers of struct rlimit need more atomicity where they are now cheap, or less overhead where they are now atomic (e.g. fork), then seqcount is certainly the right thing to use for them instead of readers using the spin lock. Though it's in signal_struct, I didn't use siglock since the access to rlimits never needs to disable irqs and doesn't overlap with other siglock uses. Instead of adding something new, I overloaded task_lock(task->group_leader) for this; it is used for other things that are not likely to happen simultaneously with limit tweaking. To me that seems preferable to adding a word, but it would be trivial (and arguably cleaner) to add a separate lock for these users (or e.g. just use seqlock, which adds two words but is optimal for readers). Most of the changes here are just the trivial s/->rlim/->signal->rlim/. I stumbled across what must be a long-standing bug, in reparent_to_init. It does: memcpy(current->rlim, init_task.rlim, sizeof(*(current->rlim))); when surely it was intended to be: memcpy(current->rlim, init_task.rlim, sizeof(current->rlim)); As rlim is an array, the * in the sizeof expression gets the size of the first element, so this just changes the first limit (RLIMIT_CPU). This is for kernel threads, where it's clear that resetting all the rlimits is what you want. With that fixed, the setting of RLIMIT_FSIZE in nfsd is superfluous since it will now already have been reset to RLIM_INFINITY. The other subtlety is removing: tsk->rlim[RLIMIT_CPU].rlim_cur = RLIM_INFINITY; in exit_notify, which was to avoid a race signalling during self-reaping exit. As the limit is now shared, a dying thread should not change it for others. Instead, I avoid that race by checking current->state before the RLIMIT_CPU check. (Adding one new conditional in that path is now required one way or another, since if not for this check there would also be a new race with self-reaping exit later on clearing current->signal that would have to be checked for.) The one loose end left by this patch is with process accounting. do_acct_process temporarily resets the RLIMIT_FSIZE limit while writing the accounting record. I left this as it was, but it is now changing a limit that might be shared by other threads still running. I left this in a dubious state because it seems to me that processing accounting may already be more generally a dubious state when it comes to NPTL threads. I would think you would want one record per process, with aggregate data about all threads that ever lived in it, not a separate record for each thread. I don't use process accounting myself, but if anyone is interested in testing it out I could provide a patch to change it this way. One final note, this is not 100% to POSIX compliance in regards to rlimits. POSIX specifies that RLIMIT_CPU refers to a whole process in aggregate, not to each individual thread. I will provide patches later on to achieve that change, assuming this patch goes in first. Signed-off-by:
Roland McGrath <roland@redhat.com> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 17 Sep, 2004 1 commit
-
-
Roland McGrath authored
Exec fails to clean up posix-timers. This manifests itself in two ways, one worse than the other. In the single-threaded case, it just fails to clear out the timers on exec. POSIX says that exec clears out the timers from timer_create (though not the setitimer ones), so it's wrong that a lingering timer could fire after exec and kill the process with a signal it's not expecting. In the multi-threaded case, it not only leaves lingering timers, but it leaks them entirely when it replaces signal_struct, so they will never be freed by the process exiting after that exec. The new per-user RLIMIT_SIGPENDING actually limits the damage here, because a UID will fill up its quota with leaked timers and then never be able to use timer_create again (that's what my test program does). But if you have many many untrusted UIDs, this leak could be considered a DoS risk. Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 15 Sep, 2004 1 commit
-
-
Denis Vlasenko authored
Allocating the 'struct linux_binprm' on the stack chews up too much stackspace. Just kmalloc it instead. Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 27 Aug, 2004 1 commit
-
-
William Lee Irwin III authored
Merely removing down_read(&mm->mmap_sem) from task_vsize() is too half-assed to let stand. The following patch removes the vma iteration as well as the down_read(&mm->mmap_sem) from both task_mem() and task_statm() and callers for the CONFIG_MMU=y case in favor of accounting the various stats reported at the times of vma creation, destruction, and modification. Unlike the 2.4.x patches of the same name, this has no per-pte-modification overhead whatsoever. This patch quashes end user complaints of top(1) being slow as well as kernel hacker complaints of per-pte accounting overhead simultaneously. Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 24 Aug, 2004 3 commits
-
-
Josh Aas authored
A patch that reduces bkl usage in do_coredump. I don't see anywhere that it is necessary except for the call to format_corename, which is controlled via sysctl (sys_sysctl holds the bkl). Also make format_corename() static. Signed-off-by:
Josh Aas <josha@sgi.com> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
Ingo Molnar authored
Rework the i386 mm layout to allow applications to allocate more virtual memory, and larger contiguous chunks. - the patch is compatible with existing architectures that either make use of HAVE_ARCH_UNMAPPED_AREA or use the default mmap() allocator - there is no change in behavior. - 64-bit architectures can use the same mechanism to clean up 32-bit compatibility layouts: by defining HAVE_ARCH_PICK_MMAP_LAYOUT and providing a arch_pick_mmap_layout() function - which can then decide between various mmap() layout functions. - I also introduced a new personality bit (ADDR_COMPAT_LAYOUT) to signal older binaries that dont have PT_GNU_STACK. x86 uses this to revert back to the stock layout. I also changed x86 to not clear the personality bits upon exec(), like x86-64 already does. - once every architecture that uses HAVE_ARCH_UNMAPPED_AREA has defined its arch_pick_mmap_layout() function, we can get rid of HAVE_ARCH_UNMAPPED_AREA altogether, as a final cleanup. the new layout generation function (__get_unmapped_area()) got significant testing in FC1/2, so i'm pretty confident it's robust. Compiles & boots fine on an 'old' and on a 'new' x86 distro as well. The two known breakages were: http://www.redhatconfig.com/msg/67248.html [ 'cyzload' third-party utility broke. ] http://www.zipworld.com/au/~akpm/dde.tar.gz [ your editor broke :-) ] both were caused by application bugs that did: int ret = malloc(); if (ret <= 0) failure; such bugs are easy to spot if they happen, and if it happens it's possible to work it around immediately without having to change the binary, via the setarch patch. No other application has been found to be affected, and this particular change got pretty wide coverage already over RHEL3 and exec-shield, it's in use for more than a year. The setarch utility can be used to trigger the compatibility layout on x86, the following version has been patched to take the `-L' option: http://people.redhat.com/mingo/flexible-mmap/setarch-1.4-2.tar.gz "setarch -L i386 <command>" will run the command with the old layout. From: Hugh Dickins <hugh@veritas.com> The problem is in the flexible mmap patch: arch_get_unmapped_area_topdown is liable to give your mmap vm_start above TASK_SIZE with vm_end wrapped; which is confusing, and ends up as that BUG_ON(mm->map_count). The patch below stops that behaviour, but it's not the full solution: wilson_mmap_test -s 1000 then simply cannot allocate memory for the large mmap, whereas it works fine non-top-down. I think it's wrong to interpret a large or rlim_infinite stack rlimit as an inviolable request to reserve that much for the stack: it makes much less VM available than bottom up, not what was intended. Perhaps top down should go bottom up (instead of belly up) when it fails - but I'd probably better leave that to Ingo. Or perhaps the default should place stack below text (as WLI suggested and ELF intended, with its text defaulting to 0x08048000, small progs sharing page table between stack and text and data); with a further personality for those needing bigger stack. From: Ingo Molnar <mingo@elte.hu> - fall back to the bottom-up layout if the stack can grow unlimited (if the stack ulimit has been set to RLIM_INFINITY) - try the bottom-up allocator if the top-down allocator fails - this can utilize the hole between the true bottom of the stack and its ulimit, as a last-resort effort. Signed-off-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
Nick Piggin authored
Move balancing and child-runs-first logic from fork.c into sched.c where it belongs. * Consolidate wake_up_forked_process and wake_up_forked_thread into wake_up_new_process, and pass in clone_flags as suggested by Linus. This removes a lot of code duplication and allows all logic to be handled in that function. * Don't do balance-on-clone balancing for vfork'ed threads. * Don't do set_task_cpu or balance one clone in wake_up_new_process. Instead do it in sched_fork to fix set_cpus_allowed races. * Don't do child-runs-first for CLONE_VM processes, as there is obviously no COW benifit to be had. This is a big one, it enables Andi's workload to run well without clone balancing, because the OpenMP child threads can get balanced off to other nodes *before* they start running and allocating memory. * Rename sched_balance_exec to sched_exec: hide the policy from the API. From: Ingo Molnar <mingo@elte.hu> rename wake_up_new_process -> wake_up_new_task. in sched.c we are gradually moving away from the overloaded 'process' or 'thread' notion to the traditional task (or context) naming. Signed-off-by:
Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 23 Aug, 2004 1 commit
-
-
Mike Kravetz authored
Races have been observed between excec-time overwriting of task->comm and /proc accesses to the same data. This causes environment string information to appear in /proc. Fix that up by taking task_lock() around updates to and accesses to task->comm. Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 18 Jul, 2004 1 commit
-
-
Ingo Molnar authored
This cleans up legacy x86 binary support by introducing a new personality bit: READ_IMPLIES_EXEC, and implements Linus' suggestion to add the PROT_EXEC bit on the two affected syscall entry places, sys_mprotect() and sys_mmap(). If this bit is set then PROT_READ will also add the PROT_EXEC bit - as expected by legacy x86 binaries. The ELF loader will automatically set this bit when it encounters a legacy binary. This approach avoids the problems the previous ->def_flags solution caused. In particular this patch fixes the PROT_NONE problem in a cleaner way (http://lkml.org/lkml/2004/7/12/227), and it should fix the ia64 PROT_EXEC problem reported by David Mosberger. Also, mprotect(PROT_READ) done by legacy binaries will do the right thing as well. the details: - the personality bit is added to the personality mask upon exec(), within the ELF loader, but is not cleared (see the exceptions below). This means that if an environment that already has the bit exec()s a new-style binary it will still get the old behavior. - one exception are setuid/setgid binaries: these will reset the bit - thus local attackers cannot manually set the bit and circumvent NX protection. Legacy setuid binaries will still get the bit through the ELF loader. This gives us maximum flexibility in shaping compatibility environments. - selinux also clears the bit when switching SIDs via exec(). - x86 is the only arch making use of READ_IMPLIES_EXEC currently. Other arches will have the pre-NX-patch protection setup they always had. I have booted an old distro [RH 7.2] and two new PT_GNU_STACK distros [SuSE 9.2 and FC2] on an NX-capable CPU - they work just fine and all the mapping details are right. I've checked the PROT_NONE test-utility as well and it works as expected. I have checked various setuid scenarios as well involving legacy and new-style binaries. an improved setarch utility can be used to set the personality bit manually: http://redhat.com/~mingo/nx-patches/setarch-1.4-3.tar.gz the new '-X' flag does it, e.g.: ./setarch -X linux /bin/cat /proc/self/maps will trigger the old protection layout even on a new distro. Signed-off-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 29 Jun, 2004 1 commit
-
-
Yoav Zach authored
The proposed patch uses the aux-vector to pass the fd of the open misc binary to the interpreter, instead of using argv[1] for that purpose. Previous patch - open_nonreadable_binaries, offered the option of binfmt_misc opening the binary on behalf of the interpreter. In case binfmt_misc is requested to do that it would pass the file-descriptor of the open binary to the interpreter as its second argument (argv[1]). This method of passing the file descriptor was suspected to be problematic, since it changes the command line that users expect to see when using tools such as 'ps' and 'top'. The proposed patch changes the method of passing the fd of the open binary to the translator. Instead of passing it as an argument, binfmt_misc will request the ELF loader to pass it as a new element in the aux-vector that it prepares on the stack for ELF interpreter. With this patch, argv[1] will hold the full path to the binary regardless of whether it opened it or not. Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 27 Jun, 2004 1 commit
-
-
Ingo Molnar authored
we'd like to announce the availability of the following kernel patch: http://redhat.com/~mingo/nx-patches/nx-2.6.7-rc2-bk2-AE which makes use of the 'NX' x86 feature pioneered in AMD64 CPUs and for which support has also been announced by Intel. (other x86 CPU vendors, Transmeta and VIA announced support as well. Windows support for NX has also been announced by Microsoft, for their next service pack.) The NX feature is also being marketed as 'Enhanced Virus Protection'. This patch makes sure Linux has full support for this hardware feature on x86 too. What does this patch do? The pagetable format of current x86 CPUs does not have an 'execute' bit. This means that even if an application maps a memory area without PROT_EXEC, the CPU will still allow code to be executed in this memory. This property is often abused by exploits when they manage to inject hostile code into this memory, for example via a buffer overflow. The NX feature changes this and adds a 'dont execute' bit to the PAE pagetable format. But since the flag defaults to zero (for compatibility reasons), all pages are executable by default and the kernel has to be taught to make use of this bit. If the NX feature is supported by the CPU then the patched kernel turns on NX and it will enforce userspace executability constraints such as a no-exec stack and no-exec mmap and data areas. This means less chance for stack overflows and buffer-overflows to cause exploits. furthermore, the patch also implements 'NX protection' for kernelspace code: only the kernel code and modules are executable - so even kernel-space overflows are harder (in some cases, impossible) to exploit. Here is how kernel code that tries to execute off the stack is stopped: kernel tried to access NX-protected page - exploit attempt? (uid: 500) Unable to handle kernel paging request at virtual address f78d0f40 printing eip: ... The patch is based on a prototype NX patch written for 2.4 by Intel - special thanks go to Suresh Siddha and Jun Nakajima @ Intel. The existing NX support in the 64-bit x86_64 kernels has been written by Andi Kleen and this patch is modeled after his code. Arjan van de Ven has also provided lots of feedback and he has integrated the patch into the Fedora Core 2 kernel. Test rpms are available for download at: http://redhat.com/~arjanv/2.6/RPMS.kernel/ the kernel-2.6.6-1.411 rpms have the NX patch applied. here's a quickstart to recompile the vanilla kernel from source with the NX patch: http://redhat.com/~mingo/nx-patches/QuickStart-NX.txt update: - make the heap non-executable on PT_GNU_STACK binaries. - make all data mmap()s (and the heap) executable on !PT_GNU_STACK (legacy) binaries. This has no effect on non-NX CPUs, but should be much more compatible on NX CPUs. The only effect it has it has on non-NX CPUs is the extra 'x' bit displayed in /proc/PID/maps. Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 18 Jun, 2004 2 commits
-
-
Russell King authored
This patch cleans up needless includes of asm/pgalloc.h from the fs/ kernel/ and mm/ subtrees. Compile tested on multiple ARM platforms, and x86, this patch appears safe. This patch is part of a larger patch aiming towards getting the include of asm/pgtable.h out of linux/mm.h, so that asm/pgtable.h can sanely get at things like mm_struct and friends. I suggest testing in -mm for a while to ensure there aren't any hidden arch issues. Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
Yoav Zach authored
<background> I work in a group that works on enabling the IA-32 Execution Layer (http://www.intel.com/pressroom/archive/releases/20040113comp.htm) on Linux. In a few words - this is a dynamic translator for IA-32 binaries on IPF platform. Following David Mosberger's advice - we use the binfmt_misc mechanism for the invocation of the translator whenever the user tries to exec an IA-32 binary. The EL is meant to help in the migration path from IA-32 to IPF. From our beta customers we learnt that at first stage - they tend to keep their environment mostly intact, using the legacy IA-32 binaries. Such an environment has, naturally, setuid and non-readable binaries. It will be useless to ask the administrator to change the settings of such an environment - some of them are very complex, and the administrators are reluctant to make any changes in a system that already proved itself to be robust and secure. So, our target with these patches is not to enhance the support for scripts but rather to allow a translator to be integrated into a working environment that is not (and should not be) aware to the fact it's being emulated. As I said before - it is practically hopeless to expect an administrator of such a system to change it so that it will suit the current behavior of binfmt_misc. But, even if we could do that, I'm not sure it would be a good idea - these changes are likely to be less secure than the suggested patches - - In order to execute non-readable binaries the binary will have to be made readable, which is obviously less secure than allowing only a trusted translator to read it - There will be no way for the translator to calculate the accurate AT_SECURE value for the translated process. This might end up with the translated process running in a non-secured mode when it actually needs to be secured. </background> I prepared a patch that solves a couple of problems that interpreters have when invoked via binfmt_misc. currently - 1) such interpreters cannot open non-readable binaries 2) the processes will have their credentials and security attributes calculated according to interpreter permissions and not those of the original binary the proposed patch solves these problems by - 1) opening the binary on behalf of the interpreter and passing its fd instead of the path as argv[1] to the interpreter 2) calling prepare_binprm with the file struct of the binary and not the one of the interpreter The new functionality is enabled by adding a special flag to the registration string. If this flag is not added then old behavior is not changed. A preliminary version of this patch was sent to the list on 9/1/2003 with the title "[PATCH]: non-readable binaries - binfmt_misc 2.6.0-test4". This new version fixes the concerns that were raised by the patch, except of calling unshare_files() before allocating a new fd. this is because this feature did not enter 2.6 yet. Arun Sharma <arun.sharma@intel.com> says: We were going through an internal review of this patch: http://marc.theaimsgroup.com/?l=linux-kernel&m=107424598901720&w=2 which is in your tree already. I'm not sure if this line of code got sufficient review. + /* call prepare_binprm before switching to interpreter's file + * so that all security calculation will be done according to + * binary and not interpreter */ + retval = prepare_binprm(bprm); The case that concerns me is: unprivileged interpreter and a privileged binary. One can use binfmt_misc to execute untrusted code (interpreter) with elevated privileges. One could argue that all binfmt_misc interpreters are trusted, because only root can register them. But that's a change from the traditional behavior of binfmt_misc (and binfmt_script). (Update): Arun pointed out that calculating the process credentials according to the binary that needs to be translated is a bit risky, since it requires the administrator to pay extra attention not to register an interpreter which is not intended to run with root credentials. After discussing this issue with him, I would like to propose a modified patch: The old patch did 2 things - 1) open the binary for reading and 2) calculate the credentials according to the binary. I removed the riskier part of changing the credentials calculation, so the revised patch only opens the binary for reading. It also includes few words of warning in the description of the 'open-binary' feature in binfmt_misc.txt, and makes the function entry_status print the flags in use. As for the 'credentials' part of the patch, I will prepare a separate patch for it and send it again to the LKML, describe the problem and ask for people comments. Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 03 Jun, 2004 1 commit
-
-
Andrew Morton authored
From: Jack Steiner <steiner@sgi.com> It looks like the call to sched_balance_exec() from do_execve() is in the wrong spot. The code calls sched_balance_exec() before determining whether "filename" actually exists. In many cases, users have several entries in $PATH. If a full path name is not specified on the 'exec" call, the library code iterates thru the files in the PATH list until it finds the program. This can result is numerous migrations of the parent process before the program is actually found. Signed-off-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 29 May, 2004 1 commit
-
-
Alexander Viro authored
Independent minor bits caught by sparse: - paride.h mixing void and int in ? :, used always in a void context ide-iops.c return insw() - insw is void() - scsi/constants.c uses undefined macros in #if; added #define to 0 in case that used to leave it undefined - usb/host/hcd.h: fixed-point arithmetics in constant - fs/exec.c: missing UL on a large constant - fs/locks.c: #if where #ifdef should've been - fs.h: missing UL on MAX_LFS_FILESIZE in 64bit case
-
- 22 May, 2004 8 commits
-
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Andrea Arcangeli's anon_vma object-based reverse mapping scheme for anonymous pages. Instead of tracking anonymous pages by pte_chains or by mm, this tracks them by vma. But because vmas are frequently split and merged (particularly by mprotect), a page cannot point directly to its vma(s), but instead to an anon_vma list of those vmas likely to contain the page - a list on which vmas can easily be linked and unlinked as they come and go. The vmas on one list are all related, either by forking or by splitting. This has three particular advantages over anonmm: that it can cope effortlessly with mremap moves; and no longer needs page_table_lock to protect an mm's vma tree, since try_to_unmap finds vmas via page -> anon_vma -> vma instead of using find_vma; and should use less cpu for swapout since it can locate its anonymous vmas more quickly. It does have disadvantages too: a lot more change in mmap.c to deal with anon_vmas, though small straightforward additions now that the vma merging has been refactored there; more lowmem needed for each anon_vma and vma structure; an additional restriction on the merging of vmas (cannot be merged if already assigned different anon_vmas, since then their pages will be pointing to different heads). (There would be no need to enlarge the vma structure if anonymous pages belonged only to anonymous vmas; but private file mappings accumulate anonymous pages by copy-on-write, so need to be listed in both anon_vma and prio_tree at the same time. A different implementation could avoid that by using anon_vmas only for purely anonymous vmas, and use the existing prio_tree to locate cow pages - but that would involve a long search for each single private copy, probably not a good idea.) Where before the vm_pgoff of a purely anonymous (not file-backed) vma was meaningless, now it represents the virtual start address at which that vma is mapped - which the standard file pgoff manipulations treat linearly as vmas are split and merged. But if mremap moves the vma, then it generally carries its original vm_pgoff to the new location, so pages shared with the old location can still be found. Magic. Hugh has massaged it somewhat: building on the earlier rmap patches, this patch is a fifth of the size of Andrea's original anon_vma patch. Please note that this posting will be his first sight of this patch, which he may or may not approve.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Silly final patch for anonmm rmap: change page_add_anon_rmap's mm arg to vma arg like anon_vma rmap, to smooth the transition between them.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> anon_vma will need to pass vma to put_dirty_page, so change it and its various callers (setup_arg_pages and its 32-on-64-bit arch variants); and please, let's rename it to install_arg_page. Earlier attempt to do this (rmap 26 __setup_arg_pages) tried to clean up those callers instead, but failed to boot: so now apply rmap 27's memset initialization of vmas to these callers too; which relieves them from needing the recently included linux/mempolicy.h. While there, moved install_arg_page's flush_dcache_page up before page_table_lock - doesn't in fact matter at all, just saves one worry when researching flush_dcache_page locking constraints.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> We're NULLifying more and more fields when initializing a vma (mpol_set_vma_default does that too, if configured to do anything). Now use memset to avoid specifying fields, and save a little code too. (Yes, I realize anon_vma will want to set vm_pgoff non-0, but I think that will be better handled at the core, since anon vm_pgoff is negotiable up until an anon_vma is actually assigned.)
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Pave the way for prio_tree by switching over to its interfaces, but actually still implement them with the same old lists as before. Most of the vma_prio_tree interfaces are straightforward. The interesting one is vma_prio_tree_next, used to search the tree for all vmas which overlap the given range: unlike the list_for_each_entry it replaces, it does not find every vma, just those that match. But this does leave handling of nonlinear vmas in a very unsatisfactory state: for now we have to search again over the maximum range to find all the nonlinear vmas which might contain a page, which of course takes away the point of the tree. Fixed in later patch of this batch. There is no need to initialize vma linkage all over, just do it before inserting the vma in list or tree. /proc/pid/statm had an odd test for its shared count: simplified to an equivalent test on vm_file.
-
Andrew Morton authored
From: Christoph Hellwig <hch@lst.de> - don't include mempolicy.h in sched.h and mm.h when a forward delcaration is enough. Andi argued against that in the past, but I'd really hate to add another header to two of the includes used in basically every driver when we can include it in the six files actually needing it instead (that number is for my ppc32 system, maybe other arches need more include in their directories) - make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.
-
Andrew Morton authored
From: Andi Kleen <ak@suse.de> NUMA API adds a policy to each VMA. During VMA creattion, merging and splitting these policies must be handled properly. This patch adds the calls to this. It is a nop when CONFIG_NUMA is not defined.
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> Lots of deletions: the next patch will put in the new anon rmap, which should look clearer if first we remove all of the old pte-pointer-based rmap from the core in this patch - which therefore leaves anonymous rmap totally disabled, anon pages locked in memory until process frees them. Leave arch files (and page table rmap) untouched for now, clean them up in a later batch. A few constructive changes amidst all the deletions: Choose names (e.g. page_add_anon_rmap) and args (e.g. no more pteps) now so we need not revisit so many files in the next patch. Inline function page_dup_rmap for fork's copy_page_range, simply bumps mapcount under lock. cond_resched_lock in copy_page_range. Struct page rearranged: no pte union, just mapcount moved next to atomic count, so two ints can occupy one long on 64-bit; i386 struct page now 32 bytes even with PAE. Never pass PageReserved to page_remove_rmap, only do_wp_page did so. From: Hugh Dickins <hugh@veritas.com> Move page_add_anon_rmap's BUG_ON(page_mapping(page)) inside the rmap_lock (well, might as well just check mapping if !mapcount then): if this page is being mapped or unmapped on another cpu at the same time, page_mapping's PageAnon(page) and page->mapping are volatile. But page_mapping(page) is used more widely: I've a nasty feeling that clear_page_anon, page_add_anon_rmap and/or page_mapping need barriers added (also in 2.6.6 itself),
-
- 26 Apr, 2004 1 commit
-
-
Andrew Morton authored
From: Chris Wright <chrisw@osdl.org> Contributions from: Stephen Smalley <sds@epoch.ncsc.mil> Andy Lutomirski <luto@stanford.edu> During exec the LSM bprm_apply_creds() hooks may tranisition the program to a new security context (like setuid binaries). The security context of the new task is dependent on state such as if the task is being ptraced. ptrace_detach() doesn't take the task_lock() when clearing task->ptrace. So there is a race possible where a process starts off being ptraced, the malicious ptracer detaches and if any checks agains task->ptrace are done more than once, the results are indeterminate. This patch ensures task_lock() is held while bprm_apply_creds() hooks are called, keeping it safe against ptrace_attach() races. Additionally, tests against task->ptrace (and ->fs->count, ->files->count and ->sighand->count all of which signify potential unsafe resource sharing during a security context transition) are done only once the results are passed down to hooks, making it safe against ptrace_detach() races. Additionally: - s/must_must_not_trace_exec/unsafe_exec/ - move unsafe_exec() call above security_bprm_apply_creds() call rather than in call for readability. - fix dummy hook to honor the case where root is ptracing - couple minor formatting/spelling fixes
-
- 21 Apr, 2004 1 commit
-
-
Andrew Morton authored
From: Andy Lutomirski <luto@myrealbox.com> Fixes from me, Olaf Dietsche <olaf+list.linux-kernel@olafdietsche.de> In fs/exec.c, compute_creds does: task_lock(current); if (bprm->e_uid != current->uid || bprm->e_gid != current->gid) { current->mm->dumpable = 0; if (must_not_trace_exec(current) || atomic_read(¤t->fs->count) > 1 || atomic_read(¤t->files->count) > 1 || atomic_read(¤t->sighand->count) > 1) { if(!capable(CAP_SETUID)) { bprm->e_uid = current->uid; bprm->e_gid = current->gid; } } } current->suid = current->euid = current->fsuid = bprm->e_uid; current->sgid = current->egid = current->fsgid = bprm->e_gid; task_unlock(current); security_bprm_compute_creds(bprm); I assume the task_lock is to prevent another process (on SMP or preempt) from ptracing the execing process between the check and the assignment. If that's the concern then the fact that the lock is dropped before the call to security_brpm_compute_creds means that, if security_bprm_compute_creds does anything interesting, there's a race. For my (nearly complete) caps patch, I obviously need to fix this. But I think it may be exploitable now. Suppose there are two processes, A (the malicious code) and B (which uses exec). B starts out unprivileged (A and B have, e.g., uid and euid = 500). 1. A ptraces B. 2. B calls exec on some setuid-root program. 3. in cap_bprm_set_security, B sets bprm->cap_permitted to the full set. 4. B gets to compute_creds in exec.c, calls task_lock, and does not change its uid. 5. B calls task_unlock. 6. A detaches from B (on preempt or SMP). 7. B gets to task_lock in cap_bprm_compute_creds, changes its capabilities, and returns from compute_creds into load_elf_binary. 8. load_elf_binary calls create_elf_tables (line 852 in 2.6.5-mm1), which calls cap_bprm_secureexec (through LSM), which returns false (!). 9. exec finishes. The setuid program is now running with uid=euid=500 but full permitted capabilities. There are two (or three) ways to effectively get local root now: 1. IIRC, linux 2.4 doesn't check capabilities in ptrace, so A could just ptrace B again. 2. LD_PRELOAD. 3. There are probably programs that will misbehave on their own under these circumstances. Is there some reason why this is not doable? The patch renames bprm_compute_creds to bprm_apply_creds and moves all uid logic into the hook, where the test and the resulting modification can both happen under task_lock(). This way, out-of-tree LSMs will fail to compile instead of malfunctioning. It should also make life easier for LSMs and will certainly make it easier for me to finish the cap patch.
-
- 17 Apr, 2004 1 commit
-
-
Petr Vandrovec authored
The recent controlling terminal changes broke exec from multithreaded application because de_thread was not upgraded to new arrangement. I know that I should not have LD_PRELOAD library which automatically creates one thread, but it looked like a cool solution to the problem I had. de_thread must initialize the controlling terminal information in the new thread group.
-
- 12 Apr, 2004 4 commits
-
-
Andrew Morton authored
From: Hugh Dickins <hugh@veritas.com> First of a batch of three rmap patches: this initial batch of three paving the way for a move to some form of object-based rmap (probably Andrea's, but drawing from mine too), and making almost no functional change by itself. A few days will intervene before the next batch, to give the struct page changes in the second patch some exposure before proceeding. rmap 1 create include/linux/rmap.h Start small: linux/rmap-locking.h has already gathered some declarations unrelated to locking, and the rest of the rmap declarations were over in linux/swap.h: gather them all together in linux/rmap.h, and rename the pte_chain_lock to rmap_lock.
-
Andrew Morton authored
From: Roland McGrath <roland@redhat.com> The posix-timers implementation associates timers with the creating thread and destroys timers when their creator thread dies. POSIX clearly specifies that these timers are per-process, and a timer should not be torn down when the thread that created it exits. I hope there won't be any controversy on what the correct semantics are here, since POSIX is clear and the Linux feature is called "posix-timers". The attached program built with NPTL -lrt -lpthread demonstrates the bug. The program is correct by POSIX, but fails on Linux. Note that a until just the other day, NPTL had a trivial bug that always disabled its use of kernel timer syscalls (check strace for lack of timer_create/SYS_259). So unless you have built your own NPTL libs very recently, you probably won't see the kernel calls actually used by this program. Also attached is my patch to fix this. It (you guessed it) moves the posix_timers field from task_struct to signal_struct. Access is now governed by the siglock instead of the task lock. exit_itimers is called from __exit_signal, i.e. only on the death of the last thread in the group, rather than from do_exit for every thread. Timers' it_process fields store the group leader's pointer, which won't die. For the case of SIGEV_THREAD_ID, I hold a ref on the task_struct for it_process to stay robust in case the target thread dies; the ref is released and the dangling pointer cleared when the timer fires and the target thread is dead. (This should only come up in a buggy user program, so noone cares exactly how the kernel handles that case. But I think what I did is robust and sensical.) /* Test for bogus per-thread deletion of timers. */ #include <stdio.h> #include <error.h> #include <time.h> #include <signal.h> #include <stdint.h> #include <sys/time.h> #include <sys/resource.h> #include <unistd.h> #include <pthread.h> /* Creating timers in another thread should work too. */ static void *do_timer_create(void *arg) { struct sigevent *const sigev = arg; timer_t *const timerId = sigev->sigev_value.sival_ptr; if (timer_create(CLOCK_REALTIME, sigev, timerId) < 0) { perror("timer_create"); return NULL; } return timerId; } int main(void) { int i, res; timer_t timerId; struct itimerspec itval; struct sigevent sigev; itval.it_interval.tv_sec = 2; itval.it_interval.tv_nsec = 0; itval.it_value.tv_sec = 2; itval.it_value.tv_nsec = 0; sigev.sigev_notify = SIGEV_SIGNAL; sigev.sigev_signo = SIGALRM; sigev.sigev_value.sival_ptr = (void *)&timerId; for (i = 0; i < 100; i++) { printf("cnt = %d\n", i); pthread_t thr; res = pthread_create(&thr, NULL, &do_timer_create, &sigev); if (res) { error(0, res, "pthread_create"); continue; } void *val; res = pthread_join(thr, &val); if (res) { error(0, res, "pthread_join"); continue; } if (val == NULL) continue; res = timer_settime(timerId, 0, &itval, NULL); if (res < 0) perror("timer_settime"); res = timer_delete(timerId); if (res < 0) perror("timer_delete"); } return 0; }
-
Andrew Morton authored
From: Kurt Garloff <garloff@suse.de> A patch to parse the elf binaries for a PT_GNU_STACK section to set the stack non-executable if possible. Most parts have been shamelessly stolen from Ingo Molnar's more ambitious stackshield http://people.redhat.com/mingo/exec-shield/exec-shield-2.6.4-C9 The toolchain has meanwhile support for marking the binaries with a PT_GNU_STACK section wwithout x bit as needed. If no such section is found, we leave the stack to whatever the arch defaults to. If there is one, we explicitly disabled the VM_EXEC bit if no x bit is found, otherwise explicitly enable.
-
Andrew Morton authored
From: Roland McGrath <roland@redhat.com> This patch moves all the fields relating to job control from task_struct to signal_struct, so that all this info is properly per-process rather than being per-thread.
-
- 25 Feb, 2004 1 commit
-
-
Andrew Morton authored
From: "Randy.Dunlap" <rddunlap@osdl.org> Add syscalls.h, which contains prototypes for the kernel's system calls. Replace open-coded declarations all over the place. This patch found a couple of prior bugs. It appears to be more important with -mregparm=3 as we discover more asmlinkage mismatches. Some syscalls have arch-dependent arguments, so their prototypes are in the arch-specific unistd.h. Maybe it should have been asm/syscalls.h, but there were already arch-specific syscall prototypes in asm/unistd.h... Tested on x86, ia64, x86_64, ppc64, s390 and sparc64. May cause trivial-to-fix build breakage on other architectures.
-
- 18 Feb, 2004 1 commit
-
-
Andrew Morton authored
From: Andi Kleen <ak@muc.de> Some x86-64 users were complaining that coredumps >2GB don't work. This will enable large coredump for everybody. Apparently the 32bit gdb/binutils cannot handle them, but I hear the binutils people are working on fixing that. I doubt it will harm people - unreadable coredumps are not worse than no coredump and it won't make any difference in space usage if you get a 1.99GB or a 2.5GB coredump. So just enable it unconditionally. If it should be really a problem for 32bit the rlimit defaults in resource.h could be changed. For file systems that don't support O_LARGEFILE you should just get an truncated coredumps for big address spaces.
-
- 19 Jan, 2004 1 commit
-
-
Andrew Morton authored
From: Trond Myklebust <trond.myklebust@fys.uio.no> The following patch fixes a bug when initializing the intent structure in sys_uselib(): intents use the FMODE_READ convention rather than O_RDONLY. It also adds a missing open intent to open_exec(). This ensures that NFS clients will do the necessary close-to-open data cache consistency checking.
-
- 30 Dec, 2003 1 commit
-
-
Andrew Morton authored
From: IWAMOTO Toshihiro <iwamoto@valinux.co.jp> I found linux-2.6.0-test11 leaks memory when execve fails. I've also checked the bitkeeper tree and the problem seems to be unchanged. The attached patch is a partial backout of bitkeeper rev. 1.87 of fs/exec.c. I guess the original change was a simple mistake. (free_arg_pages() is a NOP when CONFIG_MMU is defined).
-
- 29 Dec, 2003 2 commits
-
-
Andrew Morton authored
From: Chris Wright <chrisw@osdl.org> Use the new steal_locks helper to steal the locks from the old files struct left from unshare_files() when the new unshared struct files gets used.
-
Andrew Morton authored
From: Chris Wright <chrisw@osdl.org> Use unshare_files during binary loading to eliminate potential leak of the binary's fd installed during execve(). As is, this breaks binfmt_som.c
-
- 09 Oct, 2003 1 commit
-
-
Linus Torvalds authored
cause NULL pointer references in /proc. Moreover, it's questionable whether the whole thing makes sense at all. Per-thread state is good. Cset exclude: davem@nuts.ninka.net|ChangeSet|20031005193942|01097 Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180420|42200 Cset exclude: akpm@osdl.org[torvalds]|ChangeSet|20031005180411|42211
-
- 05 Oct, 2003 1 commit
-
-
Andrew Morton authored
From: Roland McGrath <roland@redhat.com> This patch completes what was started with the `process_group' accessor function, moving all the job control-related fields from task_struct into signal_struct and using process_foo accessor functions to read them. All these things are per-process in POSIX, none per-thread. Off hand it's hard to come up with the hairy MT scenarios in which the existing code would do insane things, but trust me, they're there. At any rate, all the uses being done via inline accessor functions now has got to be all good. I did a "make allyesconfig" build and caught the few random drivers and whatnot that referred to these fields. I was surprised to find how few references to ->tty there really were to fix up. I'm sure there will be a few more fixups needed in non-x86 code. The only actual testing of a running kernel with these patches I've done is on my normal minimal x86 config. Everything works fine as it did before as far as I can tell. One issue that may be of concern is the lack of any locking on multiple threads diddling these fields. I don't think it really matters, though there might be some obscure races that could produce inconsistent job control results. Nothing shattering, I'm sure; probably only something like a multi-threaded program calling setsid while its other threads do tty i/o, which never happens in reality. This is the same situation we get by using ->group_leader->foo without other synchronization, which seemed to be the trend and noone was worried about it.
-