• Ingo Molnar's avatar
    [PATCH] signal-fixes-2.5.59-A4 · ebf5ebe3
    Ingo Molnar authored
    this is the current threading patchset, which accumulated up during the
    past two weeks. It consists of a biggest set of changes from Roland, to
    make threaded signals work. There were still tons of testcases and
    boundary conditions (mostly in the signal/exit/ptrace area) that we did
    not handle correctly.
    
    Roland's thread-signal semantics/behavior/ptrace fixes:
    
     - fix signal delivery race with do_exit() => signals are re-queued to the
       'process' if do_exit() finds pending unhandled ones. This prevents
       signals getting lost upon thread-sys_exit().
    
     - a non-main thread has died on one processor and gone to TASK_ZOMBIE,
       but before it's gotten to release_task a sys_wait4 on the other
       processor reaps it.  It's only because it's ptraced that this gets
       through eligible_child.  Somewhere in there the main thread is also
       dying so it reparents the child thread to hit that case.  This means
       that there is a race where P might be totally invalid.
    
     - forget_original_parent is not doing the right thing when the group
       leader dies, i.e. reparenting threads to init when there is a zombie
       group leader.  Perhaps it doesn't matter for any practical purpose
       without ptrace, though it makes for ppid=1 for each thread in core
       dumps, which looks funny. Incidentally, SIGCHLD here really should be
       p->exit_signal.
    
     - one of the gdb tests makes a questionable assumption about what kill
       will do when it has some threads stopped by ptrace and others running.
    
    exit races:
    
    1. Processor A is in sys_wait4 case TASK_STOPPED considering task P.
       Processor B is about to resume P and then switch to it.
    
       While A is inside that case block, B starts running P and it clears
       P->exit_code, or takes a pending fatal signal and sets it to a new
       value. Depending on the interleaving, the possible failure modes are:
            a. A gets to its put_user after B has cleared P->exit_code
               => returns with WIFSTOPPED, WSTOPSIG==0
            b. A gets to its put_user after B has set P->exit_code anew
               => returns with e.g. WIFSTOPPED, WSTOPSIG==SIGKILL
    
       A can spend an arbitrarily long time in that case block, because
       there's getrusage and put_user that can take page faults, and
       write_lock'ing of the tasklist_lock that can block.  But even if it's
       short the race is there in principle.
    
    2. This is new with NPTL, i.e. CLONE_THREAD.
       Two processors A and B are both in sys_wait4 case TASK_STOPPED
       considering task P.
    
       Both get through their tests and fetches of P->exit_code before either
       gets to P->exit_code = 0.  => two threads return the same pid from
       waitpid.
    
       In other interleavings where one processor gets to its put_user after
       the other has cleared P->exit_code, it's like case 1(a).
    
    
    3. SMP races with stop/cont signals
    
       First, take:
    
            kill(pid, SIGSTOP);
            kill(pid, SIGCONT);
    
       or:
    
            kill(pid, SIGSTOP);
            kill(pid, SIGKILL);
    
       It's possible for this to leave the process stopped with a pending
       SIGCONT/SIGKILL.  That's a state that should never be possible.
       Moreover, kill(pid, SIGKILL) without any repetition should always be
       enough to kill a process.  (Likewise SIGCONT when you know it's
       sequenced after the last stop signal, must be sufficient to resume a
       process.)
    
    4. take:
    
            kill(pid, SIGKILL);     // or any fatal signal
            kill(pid, SIGCONT);     // or SIGKILL
    
        it's possible for this to cause pid to be reaped with status 0
        instead of its true termination status.  The equivalent scenario
        happens when the process being killed is in an _exit call or a
        trap-induced fatal signal before the kills.
    
    plus i've done stability fixes for bugs that popped up during
    beta-testing, and minor tidying of Roland's changes:
    
     - a rare tasklist corruption during exec, causing some very spurious and
       colorful crashes.
    
     - a copy_process()-related dereference of already freed thread structure
       if hit with a SIGKILL in the wrong moment.
    
     - SMP spinlock deadlocks in the signal code
    
    this patchset has been tested quite well in the 2.4 backport of the
    threading changes - and i've done some stresstesting on 2.5.59 SMP as
    well, and did an x86 UP testcompile + testboot as well.
    ebf5ebe3
suspend.c 33.9 KB