sched: Lower chances of cputime scaling overflow
Some users have reported that after running a process with hundreds of threads on intensive CPU-bound loads, the cputime of the group started to freeze after a few days. This is due to how we scale the tick-based cputime against the scheduler precise execution time value. We add the values of all threads in the group and we multiply that against the sum of the scheduler exec runtime of the whole group. This easily overflows after a few days/weeks of execution. A proposed solution to solve this was to compute that multiplication on stime instead of utime: 62188451 ("cputime: Avoid multiplication overflow on utime scaling") The rationale behind that was that it's easy for a thread to spend most of its time in userspace under intensive CPU-bound workload but it's much harder to do CPU-bound intensive long run in the kernel. This postulate got defeated when a user recently reported he was still seeing cputime freezes after the above patch. The workload that triggers this issue relates to intensive networking workloads where most of the cputime is consumed in the kernel. To reduce much more the opportunities for multiplication overflow, lets reduce the multiplication factors to the remainders of the division between sched exec runtime and cputime. Assuming the difference between these shouldn't ever be that large, it could work on many situations. This gets the same results as in the upstream scaling code except for a small difference: the upstream code always rounds the results to the nearest integer not greater to what would be the precise result. The new code rounds to the nearest integer either greater or not greater. In practice this difference probably shouldn't matter but it's worth mentioning. If this solution appears not to be enough in the end, we'll need to partly revert back to the behaviour prior to commit 0cf55e1e ("sched, cputime: Introduce thread_group_times()") Back then, the scaling was done on exit() time before adding the cputime of an exiting thread to the signal struct. And then we'll need to scale one-by-one the live threads cputime in thread_group_cputime(). The drawback may be a slightly slower code on exit time. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>
Showing
Please register or sign in to comment