1. 19 Nov, 2020 8 commits
  2. 10 Nov, 2020 24 commits
  3. 29 Oct, 2020 8 commits
    • Julia Lawall's avatar
      sched/fair: Check for idle core in wake_affine · d8fcb81f
      Julia Lawall authored
      In the case of a thread wakeup, wake_affine determines whether a core
      will be chosen for the thread on the socket where the thread ran
      previously or on the socket of the waker.  This is done primarily by
      comparing the load of the core where th thread ran previously (prev)
      and the load of the waker (this).
      
      commit 11f10e54 ("sched/fair: Use load instead of runnable load
      in wakeup path") changed the load computation from the runnable load
      to the load average, where the latter includes the load of threads
      that have already blocked on the core.
      
      When a short-running daemon processes happens to run on prev, this
      change raised the situation that prev could appear to have a greater
      load than this, even when prev is actually idle.  When prev and this
      are on the same socket, the idle prev is detected later, in
      select_idle_sibling.  But if that does not hold, prev is completely
      ignored, causing the waking thread to move to the socket of the waker.
      In the case of N mostly active threads on N cores, this triggers other
      migrations and hurts performance.
      
      In contrast, before commit 11f10e54, the load on an idle core
      was 0, and in the case of a non-idle waker core, the effect of
      wake_affine was to select prev as the target for searching for a core
      for the waking thread.
      
      To avoid unnecessary migrations, extend wake_affine_idle to check
      whether the core where the thread previously ran is currently idle,
      and if so simply return that core as the target.
      
      [1] commit 11f10e54 ("sched/fair: Use load instead of runnable
      load in wakeup path")
      
      This particularly has an impact when using the ondemand power manager,
      where kworkers run every 0.004 seconds on all cores, increasing the
      likelihood that an idle core will be considered to have a load.
      
      The following numbers were obtained with the benchmarking tool
      hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
      benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
      tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
      2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
      management were used.  Times are in seconds.  All experiments use all
      160 hardware threads.
      
      	v5.9/intel-pstate	v5.9+patch/intel-pstate
      bt.C.c	24.725724+-0.962340	23.349608+-1.607214
      lu.C.x	29.105952+-4.804203	25.249052+-5.561617
      sp.C.x	31.220696+-1.831335	30.227760+-2.429792
      ua.C.x	26.606118+-1.767384	25.778367+-1.263850
      
      	v5.9/ondemand		v5.9+patch/ondemand
      bt.C.c	25.330360+-1.028316	23.544036+-1.020189
      lu.C.x	35.872659+-4.872090	23.719295+-3.883848
      sp.C.x	32.141310+-2.289541	29.125363+-0.872300
      ua.C.x	29.024597+-1.667049	25.728888+-1.539772
      
      On the smaller data sets (A and B) and on the other NAS benchmarks
      there is no impact on performance.
      
      This also has a major impact on the splash2x.volrend benchmark of the
      parsec benchmark suite that goes from 1m25 without this patch to 0m45,
      in active (intel_pstate) mode.
      
      Fixes: 11f10e54 ("sched/fair: Use load instead of runnable load in wakeup path")
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
      d8fcb81f
    • Peter Zijlstra's avatar
      sched: Remove relyance on STRUCT_ALIGNMENT · 43c31ac0
      Peter Zijlstra authored
      Florian reported that all of kernel/sched/ is rebuild when
      CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is
      unexpected. This is due to us including vmlinux.lds.h.
      
      Jakub explained that the problem is that we put the alignment
      requirement on the type instead of on a variable. Type alignment is a
      minimum, the compiler is free to pick any larger alignment for a
      specific instance of the type (eg. the variable).
      
      So force the type alignment on all individual variable definitions and
      remove the undesired dependency on vmlinux.lds.h.
      
      Fixes: 85c2ce91 ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9")
      Reported-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Suggested-by: default avatarJakub Jelinek <jakub@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      43c31ac0
    • Thomas Gleixner's avatar
      sched: Reenable interrupts in do_sched_yield() · 345a957f
      Thomas Gleixner authored
      do_sched_yield() invokes schedule() with interrupts disabled which is
      not allowed. This goes back to the pre git era to commit a6efb709
      ("[PATCH] irqlock patch 2.5.27-H6") in the history tree.
      
      Reenable interrupts and remove the misleading comment which "explains" it.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
      345a957f
    • Mathieu Desnoyers's avatar
      sched: membarrier: document memory ordering scenarios · 25595eb6
      Mathieu Desnoyers authored
      Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
      Stern for refreshing my memory. Now that I have those in mind, it seems
      appropriate to serialize them to comments for posterity.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoyers@efficios.com
      25595eb6
    • Mathieu Desnoyers's avatar
      sched: membarrier: cover kthread_use_mm (v4) · 618758ed
      Mathieu Desnoyers authored
      Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
      to allow the effect of membarrier(2) to apply to kthreads accessing
      user-space memory as well.
      
      Given that no prior kthread use this guarantee and that it only affects
      kthreads, adding this guarantee does not affect user-space ABI.
      
      Refine the check in membarrier_global_expedited to exclude runqueues
      running the idle thread rather than all kthreads from the IPI cpumask.
      
      Now that membarrier_global_expedited can IPI kthreads, the scheduler
      also needs to update the runqueue's membarrier_state when entering lazy
      TLB state.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
      618758ed
    • Mathieu Desnoyers's avatar
      sched: fix exit_mm vs membarrier (v4) · 5bc78502
      Mathieu Desnoyers authored
      exit_mm should issue memory barriers after user-space memory accesses,
      before clearing current->mm, to order user-space memory accesses
      performed prior to exit_mm before clearing tsk->mm, which has the
      effect of skipping the membarrier private expedited IPIs.
      
      exit_mm should also update the runqueue's membarrier_state so
      membarrier global expedited IPIs are not sent when they are not
      needed.
      
      The membarrier system call can be issued concurrently with do_exit
      if we have thread groups created with CLONE_VM but not CLONE_THREAD.
      
      Here is the scenario I have in mind:
      
      Two thread groups are created, A and B. Thread group B is created by
      issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
      Let's assume we have a single thread within each thread group (Thread A
      and Thread B).
      
      The AFAIU we can have:
      
      Userspace variables:
      
      int x = 0, y = 0;
      
      CPU 0                   CPU 1
      Thread A                Thread B
      (in thread group A)     (in thread group B)
      
      x = 1
      barrier()
      y = 1
      exit()
      exit_mm()
      current->mm = NULL;
                              r1 = load y
                              membarrier()
                                skips CPU 0 (no IPI) because its current mm is NULL
                              r2 = load x
                              BUG_ON(r1 == 1 && r2 == 0)
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
      5bc78502
    • Peter Zijlstra's avatar
      sched/fair: Exclude the current CPU from find_new_ilb() · 45da7a2b
      Peter Zijlstra authored
      It is possible for find_new_ilb() to select the current CPU, however,
      this only happens from newidle balancing, in which case need_resched()
      will be true, and consequently nohz_csd_func() will not trigger the
      softirq.
      
      Exclude the current CPU from becoming an ILB target.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      45da7a2b
    • Peter Zijlstra's avatar
      sched/cpupri: Add CPUPRI_HIGHER · b13772f8
      Peter Zijlstra authored
      Add CPUPRI_HIGHER above the RT99 priority to denote the CPU is in use
      by higher priority tasks (specifically deadline).
      
      XXX: we should probably drive PUSH-PULL from cpupri, that would
      automagically result in an RT-PUSH when DL sets cpupri to CPUPRI_HIGHER.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      b13772f8