1. 04 Feb, 2019 10 commits
    • Andrea Parri's avatar
      sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() · c546951d
      Andrea Parri authored
      move_queued_task() synchronizes with task_rq_lock() as follows:
      
      	move_queued_task()		task_rq_lock()
      
      	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
      	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
      	[S] ->cpu = new_cpu		[L] ->on_rq
      
      where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
      address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
      "[L] ->on_rq" by the ACQUIRE itself.
      
      Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
      this address dependency.  Also, mark the accesses to ->cpu and ->on_rq
      with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.
      Signed-off-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c546951d
    • Hidetoshi Seto's avatar
      sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK · 1ca4fa3a
      Hidetoshi Seto authored
      register_sched_domain_sysctl() copies the cpu_possible_mask into
      sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been
      allocated (ie, CONFIG_CPUMASK_OFFSTACK is set).  However, when
      CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left
      uninitialized (all zeroes) and the kernel may fail to initialize
      sched_domain sysctl entries for all possible CPUs.
      
      This is visible to the user if the kernel is booted with maxcpus=n, or
      if ACPI tables have been modified to leave CPUs offline, and then
      checking for missing /proc/sys/kernel/sched_domain/cpu* entries.
      
      Fix this by separating the allocation and initialization, and adding a
      flag to initialize the possible CPU entries while system booting only.
      Tested-by: default avatarSyuuichirou Ishii <ishii.shuuichir@jp.fujitsu.com>
      Tested-by: default avatarTarumizu, Kohei <tarumizu.kohei@jp.fujitsu.com>
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: default avatarJoe Lawrence <joe.lawrence@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1ca4fa3a
    • Vincent Guittot's avatar
      sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity · 10a35e68
      Vincent Guittot authored
      util_est is mainly meant to be a lower-bound for tasks utilization.
      That's why task_util_est() returns the actual util_avg when it's higher
      than the estimated utilization.
      
      With new invaraince signal and without any special check on samples
      collection, if a task is limited because of thermal capping for
      example, we could end up overestimating its utilization and thus
      perhaps generating an unwanted frequency spike when the capping is
      relaxed... and (even worst) it will take some more activations for the
      estimated utilization to converge back to the actual utilization.
      
      Since we cannot easily know if there is idle time in a CPU when a task
      completes an activation with a utilization higher then the CPU capacity,
      we skip the sampling when utilization is higher than CPU's capacity.
      Suggested-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: pjt@google.com
      Cc: pkondeti@codeaurora.org
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Link: https://lkml.kernel.org/r/1548257214-13745-4-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      10a35e68
    • Vincent Guittot's avatar
      sched/fair: Update scale invariance of PELT · 23127296
      Vincent Guittot authored
      The current implementation of load tracking invariance scales the
      contribution with current frequency and uarch performance (only for
      utilization) of the CPU. One main result of this formula is that the
      figures are capped by current capacity of CPU. Another one is that the
      load_avg is not invariant because not scaled with uarch.
      
      The util_avg of a periodic task that runs r time slots every p time slots
      varies in the range :
      
          U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
      
      with U is the max util_avg value = SCHED_CAPACITY_SCALE
      
      At a lower capacity, the range becomes:
      
          U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
      
      with C reflecting the compute capacity ratio between current capacity and
      max capacity.
      
      so C tries to compensate changes in (1-y^r') but it can't be accurate.
      
      Instead of scaling the contribution value of PELT algo, we should scale the
      running time. The PELT signal aims to track the amount of computation of
      tasks and/or rq so it seems more correct to scale the running time to
      reflect the effective amount of computation done since the last update.
      
      In order to be fully invariant, we need to apply the same amount of
      running time and idle time whatever the current capacity. Because running
      at lower capacity implies that the task will run longer, we have to ensure
      that the same amount of idle time will be applied when system becomes idle
      and no idle time has been "stolen". But reaching the maximum utilization
      value (SCHED_CAPACITY_SCALE) means that the task is seen as an
      always-running task whatever the capacity of the CPU (even at max compute
      capacity). In this case, we can discard this "stolen" idle times which
      becomes meaningless.
      
      In order to achieve this time scaling, a new clock_pelt is created per rq.
      The increase of this clock scales with current capacity when something
      is running on rq and synchronizes with clock_task when rq is idle. With
      this mechanism, we ensure the same running and idle time whatever the
      current capacity. This also enables to simplify the pelt algorithm by
      removing all references of uarch and frequency and applying the same
      contribution to utilization and loads. Furthermore, the scaling is done
      only once per update of clock (update_rq_clock_task()) instead of during
      each update of sched_entities and cfs/rt/dl_rq of the rq like the current
      implementation. This is interesting when cgroup are involved as shown in
      the results below:
      
      On a hikey (octo Arm64 platform).
      Performance cpufreq governor and only shallowest c-state to remove variance
      generated by those power features so we only track the impact of pelt algo.
      
      each test runs 16 times:
      
      	./perf bench sched pipe
      	(higher is better)
      	kernel	tip/sched/core     + patch
      	        ops/seconds        ops/seconds         diff
      	cgroup
      	root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
      	level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
      	level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%
      
      	hackbench -l 1000
      	(lower is better)
      	kernel	tip/sched/core     + patch
      	        duration(sec)      duration(sec)        diff
      	cgroup
      	root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
      	level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
      	level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%
      
      Then, the responsiveness of PELT is improved when CPU is not running at max
      capacity with this new algorithm. I have put below some examples of
      duration to reach some typical load values according to the capacity of the
      CPU with current implementation and with this patch. These values has been
      computed based on the geometric series and the half period value:
      
        Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
        972 (95%)    138ms         not reachable            276ms
        486 (47.5%)  30ms          138ms                     60ms
        256 (25%)    13ms           32ms                     26ms
      
      On my hikey (octo Arm64 platform) with schedutil governor, the time to
      reach max OPP when starting from a null utilization, decreases from 223ms
      with current scale invariance down to 121ms with the new algorithm.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: pjt@google.com
      Cc: pkondeti@codeaurora.org
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      23127296
    • Vincent Guittot's avatar
      sched/fair: Move the rq_of() helper function · 62478d99
      Vincent Guittot authored
      Move rq_of() helper function so it can be used in pelt.c
      
      [ mingo: Improve readability while at it. ]
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: pjt@google.com
      Cc: pkondeti@codeaurora.org
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      62478d99
    • Elena Reshetova's avatar
      sched/core: Convert task_struct.stack_refcount to refcount_t · f0b89d39
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.stack_refcount is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.stack_refcount it might make a difference
      in following places:
      
       - try_get_task_stack(): increment in refcount_inc_not_zero() only
         guarantees control dependency on success vs. fully ordered
         atomic counterpart
       - put_task_stack(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f0b89d39
    • Elena Reshetova's avatar
      sched/core: Convert task_struct.usage to refcount_t · ec1d2819
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.usage is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.usage it might make a difference
      in following places:
      
       - put_task_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ec1d2819
    • Elena Reshetova's avatar
      sched/fair: Convert numa_group.refcount to refcount_t · c45a7795
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable numa_group.refcount is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the numa_group.refcount it might make a difference
      in following places:
      
       - get_numa_group(): increment in refcount_inc_not_zero() only
         guarantees control dependency on success vs. fully ordered
         atomic counterpart
       - put_numa_group(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-4-git-send-email-elena.reshetova@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c45a7795
    • Elena Reshetova's avatar
      sched/core: Convert signal_struct.sigcnt to refcount_t · 60d4de3f
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable signal_struct.sigcnt is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the signal_struct.sigcnt it might make a difference
      in following places:
      
       - put_signal_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-3-git-send-email-elena.reshetova@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      60d4de3f
    • Elena Reshetova's avatar
      sched/core: Convert sighand_struct.count to refcount_t · d036bda7
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable sighand_struct.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the sighand_struct.count it might make a difference
      in following places:
      
       - __cleanup_sighand: decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d036bda7
  2. 29 Jan, 2019 1 commit
  3. 27 Jan, 2019 9 commits
    • Vincent Guittot's avatar
      sched/fair: Fix unnecessary increase of balance interval · 46a745d9
      Vincent Guittot authored
      In case of active balancing, we increase the balance interval to cover
      pinned tasks cases not covered by all_pinned logic. Neverthless, the
      active migration triggered by asym packing should be treated as the normal
      unbalanced case and reset the interval to default value, otherwise active
      migration for asym_packing can be easily delayed for hundreds of ms
      because of this pinned task detection mechanism.
      
      The same happens to other conditions tested in need_active_balance() like
      misfit task and when the capacity of src_cpu is reduced compared to
      dst_cpu (see comments in need_active_balance() for details).
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: valentin.schneider@arm.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      46a745d9
    • Vincent Guittot's avatar
      sched/fair: Fix rounding bug for asym packing · 4ad4e481
      Vincent Guittot authored
      When check_asym_packing() is triggered, the imbalance is set to:
      
        busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE
      
      But busiest_stat.avg_load equals:
      
        sgs->group_load * SCHED_CAPACITY_SCALE / sgs->group_capacity
      
      These divisions can generate a rounding that will make imbalance
      slightly lower than the weighted load of the cfs_rq.  But this is
      enough to skip the rq in find_busiest_queue() and prevents asym
      migration from happening.
      
      Directly set imbalance to busiest's sgs->group_load to remove the
      rounding.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: valentin.schneider@arm.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4ad4e481
    • Vincent Guittot's avatar
      sched/fair: Trigger asym_packing during idle load balance · a062d164
      Vincent Guittot authored
      Newly idle load balancing is not always triggered when a CPU becomes idle.
      This prevents the scheduler from getting a chance to migrate the task
      for asym packing.
      
      Enable active migration during idle load balance too.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: valentin.schneider@arm.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a062d164
    • Quentin Perret's avatar
      sched/doc: Document Energy Aware Scheduling · 81a930d3
      Quentin Perret authored
      Add some documentation detailing the main design points of EAS, as well
      as a list of its dependencies.
      
      Parts of this documentation are taken from Morten Rasmussen's original
      EAS posting: https://lkml.org/lkml/2015/7/7/754Co-authored-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarQuentin Perret <quentin.perret@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarQais Yousef <qais.yousef@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: corbet@lwn.net
      Cc: dietmar.eggemann@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Link: https://lkml.kernel.org/r/20190110110546.8101-3-quentin.perret@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      81a930d3
    • Quentin Perret's avatar
      PM/EM: Document the Energy Model framework · 1017b48c
      Quentin Perret authored
      Introduce a documentation file summarizing the key design points and
      APIs of the newly introduced Energy Model framework.
      Signed-off-by: default avatarQuentin Perret <quentin.perret@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: corbet@lwn.net
      Cc: dietmar.eggemann@arm.com
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: qais.yousef@arm.com
      Cc: rjw@rjwysocki.net
      Link: https://lkml.kernel.org/r/20190110110546.8101-2-quentin.perret@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1017b48c
    • Peter Zijlstra's avatar
      sched/fair: Robustify CFS-bandwidth timer locking · c0ad4aa4
      Peter Zijlstra authored
      Traditionally hrtimer callbacks were run with IRQs disabled, but with
      the introduction of HRTIMER_MODE_SOFT it is possible they run from
      SoftIRQ context, which does _NOT_ have IRQs disabled.
      
      Allow for the CFS bandwidth timers (period_timer and slack_timer) to
      be ran from SoftIRQ context; this entails removing the assumption that
      IRQs are already disabled from the locking.
      
      While mainline doesn't strictly need this, -RT forces all timers not
      explicitly marked with MODE_HARD into MODE_SOFT and trips over this.
      And marking these timers as MODE_HARD doesn't make sense as they're
      not required for RT operation and can potentially be quite expensive.
      Reported-by: default avatarTom Putzeys <tom.putzeys@be.atlascopco.com>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190107125231.GE14122@hirez.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c0ad4aa4
    • Peter Zijlstra's avatar
      sched/core: Give DCE a fighting chance · f8a696f2
      Peter Zijlstra authored
      All that fancy new Energy-Aware scheduling foo is hidden behind a
      static_key, which is awesome if you have the stuff enabled in your
      config.
      
      However, when you lack all the prerequisites it doesn't make any sense
      to pretend we'll ever actually run this, so provide a little more clue
      to the compiler so it can more agressively delete the code.
      
         text    data     bss     dec     hex filename
        50297     976      96   51369    c8a9 defconfig-build/kernel/sched/fair.o
        49227     944      96   50267    c45b defconfig-build/kernel/sched/fair.o
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f8a696f2
    • Quentin Perret's avatar
      sched/topology: Introduce a sysctl for Energy Aware Scheduling · 8d5d0cfb
      Quentin Perret authored
      In its current state, Energy Aware Scheduling (EAS) starts automatically
      on asymmetric platforms having an Energy Model (EM). However, there are
      users who want to have an EM (for thermal management for example), but
      don't want EAS with it.
      
      In order to let users disable EAS explicitly, introduce a new sysctl
      called 'sched_energy_aware'. It is enabled by default so that EAS can
      start automatically on platforms where it makes sense. Flipping it to 0
      rebuilds the scheduling domains and disables EAS.
      Signed-off-by: default avatarQuentin Perret <quentin.perret@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: adharmap@codeaurora.org
      Cc: chris.redpath@arm.com
      Cc: currojerez@riseup.net
      Cc: dietmar.eggemann@arm.com
      Cc: edubezval@gmail.com
      Cc: gregkh@linuxfoundation.org
      Cc: javi.merino@kernel.org
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: pkondeti@codeaurora.org
      Cc: rjw@rjwysocki.net
      Cc: skannan@codeaurora.org
      Cc: smuckle@google.com
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Cc: tkjos@google.com
      Cc: valentin.schneider@arm.com
      Cc: vincent.guittot@linaro.org
      Cc: viresh.kumar@linaro.org
      Link: https://lkml.kernel.org/r/20181203095628.11858-11-quentin.perret@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8d5d0cfb
    • Lukas Bulwahn's avatar
      MAINTAINERS, sched: Drop PREEMPTIBLE KERNEL section entry · 62a8ddc9
      Lukas Bulwahn authored
      The PREEMPTIBLE KERNEL section entry seems quite outdated:
      
      Robert Love is not actively maintaining the file anymore, nor a recorded
      contributor to the files in the PREEMPTIBLE KERNEL section for the last
      few years.
      
      The mailing list kpreempt-tech@lists.sourceforge.net does not exist
      anymore; the website just points to some very old patches for v2.4/v2.5.
      
      So, let's delete the PREEMPTIBLE KERNEL section entry and clean this up:
      
      - Documentation/preempt-locking.txt is not modified much anyway, and the
        changes in that file are generally maintained by Jonathan Corbet. So,
        we do not need to explicitly mention Documentation/preempt-locking.txt
        in MAINTAINERS.
      
      - include/linux/preempt.h is maintained by Peter and Ingo, so we simply
        add that file to the SCHEDULER section entry.
      
      I got directed to this issue, as I could not subscribe to the outdated
      mailing list address and decided to investigate and then cleaned this up.
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Robert Love <rml@tech9.net>
      Cc: Robert Love <robert.w.love@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190112060613.7115-1-lukas.bulwahn@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      62a8ddc9
  4. 21 Jan, 2019 3 commits
  5. 20 Jan, 2019 13 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 7d0ae236
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix endless loop in nf_tables, from Phil Sutter.
      
       2) Fix cross namespace ip6_gre tunnel hash list corruption, from
          Olivier Matz.
      
       3) Don't be too strict in phy_start_aneg() otherwise we might not allow
          restarting auto negotiation. From Heiner Kallweit.
      
       4) Fix various KMSAN uninitialized value cases in tipc, from Ying Xue.
      
       5) Memory leak in act_tunnel_key, from Davide Caratti.
      
       6) Handle chip errata of mv88e6390 PHY, from Andrew Lunn.
      
       7) Remove linear SKB assumption in fou/fou6, from Eric Dumazet.
      
       8) Missing udplite rehash callbacks, from Alexey Kodanev.
      
       9) Log dirty pages properly in vhost, from Jason Wang.
      
      10) Use consume_skb() in neigh_probe() as this is a normal free not a
          drop, from Yang Wei. Likewise in macvlan_process_broadcast().
      
      11) Missing device_del() in mdiobus_register() error paths, from Thomas
          Petazzoni.
      
      12) Fix checksum handling of short packets in mlx5, from Cong Wang.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (96 commits)
        bpf: in __bpf_redirect_no_mac pull mac only if present
        virtio_net: bulk free tx skbs
        net: phy: phy driver features are mandatory
        isdn: avm: Fix string plus integer warning from Clang
        net/mlx5e: Fix cb_ident duplicate in indirect block register
        net/mlx5e: Fix wrong (zero) TX drop counter indication for representor
        net/mlx5e: Fix wrong error code return on FEC query failure
        net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
        tools: bpftool: Cleanup license mess
        bpf: fix inner map masking to prevent oob under speculation
        bpf: pull in pkt_sched.h header for tooling to fix bpftool build
        selftests: forwarding: Add a test case for externally learned FDB entries
        selftests: mlxsw: Test FDB offload indication
        mlxsw: spectrum_switchdev: Do not treat static FDB entries as sticky
        net: bridge: Mark FDB entries that were added by user as such
        mlxsw: spectrum_fid: Update dummy FID index
        mlxsw: pci: Return error on PCI reset timeout
        mlxsw: pci: Increase PCI SW reset timeout
        mlxsw: pci: Ring CQ's doorbell before RDQ's
        MAINTAINERS: update email addresses of liquidio driver maintainers
        ...
      7d0ae236
    • Kees Cook's avatar
      pstore/ram: Avoid allocation and leak of platform data · 5631e857
      Kees Cook authored
      Yue Hu noticed that when parsing device tree the allocated platform data
      was never freed. Since it's not used beyond the function scope, this
      switches to using a stack variable instead.
      Reported-by: default avatarYue Hu <huyue2@yulong.com>
      Fixes: 35da6094 ("pstore/ram: add Device Tree bindings")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      5631e857
    • Ard Biesheuvel's avatar
      gcc-plugins: arm_ssp_per_task_plugin: fix for GCC 9+ · 2c88c742
      Ard Biesheuvel authored
      GCC 9 reworks the way the references to the stack canary are
      emitted, to prevent the value from being spilled to the stack
      before the final comparison in the epilogue, defeating the
      purpose, given that the spill slot is under control of the
      attacker that we are protecting ourselves from.
      
      Since our canary value address is obtained without accessing
      memory (as opposed to pre-v7 code that will obtain it from a
      literal pool), it is unlikely (although not guaranteed) that
      the compiler will spill the canary value in the same way, so
      let's just disable this improvement when building with GCC9+.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      2c88c742
    • Ard Biesheuvel's avatar
      gcc-plugins: arm_ssp_per_task_plugin: sign extend the SP mask · 560706d5
      Ard Biesheuvel authored
      The ARM per-task stack protector GCC plugin hits an assert in
      the compiler in some case, due to the fact the the SP mask
      expression is not sign-extended as it should be. So fix that.
      Suggested-by: default avatarKugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      560706d5
    • Linus Torvalds's avatar
      Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost · bb617b9b
      Linus Torvalds authored
      Pull virtio/vhost fixes and cleanups from Michael Tsirkin:
       "Fixes and cleanups all over the place"
      
      * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
        vhost/scsi: Use copy_to_iter() to send control queue response
        vhost: return EINVAL if iovecs size does not match the message size
        virtio-balloon: tweak config_changed implementation
        virtio: don't allocate vqs when names[i] = NULL
        virtio_pci: use queue idx instead of array idx to set up the vq
        virtio: document virtio_config_ops restrictions
        virtio: fix virtio_config_ops description
      bb617b9b
    • Linus Torvalds's avatar
      Merge tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 1be969f4
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "A handful of fixes (some of them in testing for a long time):
      
         - fix some test failures regarding cleanup after transaction abort
      
         - revert of a patch that could cause a deadlock
      
         - delayed iput fixes, that can help in ENOSPC situation when there's
           low space and a lot data to write"
      
      * tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: wakeup cleaner thread when adding delayed iput
        btrfs: run delayed iputs before committing
        btrfs: wait on ordered extents on abort cleanup
        btrfs: handle delayed ref head accounting cleanup in abort
        Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"
      1be969f4
    • Linus Torvalds's avatar
      Merge tags 'compiler-attributes-for-linus-v5.0-rc3' and... · 315a6d85
      Linus Torvalds authored
      Merge tags 'compiler-attributes-for-linus-v5.0-rc3' and 'clang-format-for-linus-v5.0-rc3' of git://github.com/ojeda/linux
      
      Pull misc clang fixes from Miguel Ojeda:
      
        - A fix for OPTIMIZER_HIDE_VAR from Michael S Tsirkin
      
        - Update clang-format with the latest for_each macro list from Jason
          Gunthorpe
      
      * tag 'compiler-attributes-for-linus-v5.0-rc3' of git://github.com/ojeda/linux:
        include/linux/compiler*.h: fix OPTIMIZER_HIDE_VAR
      
      * tag 'clang-format-for-linus-v5.0-rc3' of git://github.com/ojeda/linux:
        clang-format: Update .clang-format with the latest for_each macro list
      315a6d85
    • Florian La Roche's avatar
      fix int_sqrt64() for very large numbers · fbfaf851
      Florian La Roche authored
      If an input number x for int_sqrt64() has the highest bit set, then
      fls64(x) is 64.  (1UL << 64) is an overflow and breaks the algorithm.
      
      Subtracting 1 is a better guess for the initial value of m anyway and
      that's what also done in int_sqrt() implicitly [*].
      
      [*] Note how int_sqrt() uses __fls() with two underscores, which already
          returns the proper raw bit number.
      
          In contrast, int_sqrt64() used fls64(), and that returns bit numbers
          illogically starting at 1, because of error handling for the "no
          bits set" case. Will points out that he bug probably is due to a
          copy-and-paste error from the regular int_sqrt() case.
      Signed-off-by: default avatarFlorian La Roche <Florian.LaRoche@googlemail.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbfaf851
    • Will Deacon's avatar
      x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() · 6e693b3f
      Will Deacon authored
      Commit 594cc251 ("make 'user_access_begin()' do 'access_ok()'")
      makes the access_ok() check part of the user_access_begin() preceding a
      series of 'unsafe' accesses.  This has the desirable effect of ensuring
      that all 'unsafe' accesses have been range-checked, without having to
      pick through all of the callsites to verify whether the appropriate
      checking has been made.
      
      However, the consolidated range check does not inhibit speculation, so
      it is still up to the caller to ensure that they are not susceptible to
      any speculative side-channel attacks for user addresses that ultimately
      fail the access_ok() check.
      
      This is an oversight, so use __uaccess_begin_nospec() to ensure that
      speculation is inhibited until the access_ok() check has passed.
      Reported-by: default avatarJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e693b3f
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · b0f3e768
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "Three arm64 fixes for -rc3.
      
        We've plugged a couple of nasty issues involving KASLR-enabled
        kernels, and removed a redundant #define that was introduced as part
        of the KHWASAN fixes from akpm at -rc2.
      
         - Fix broken kpti page-table rewrite in bizarre KASLR configuration
      
         - Fix module loading with KASLR
      
         - Remove redundant definition of ARCH_SLAB_MINALIGN"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        kasan, arm64: remove redundant ARCH_SLAB_MINALIGN define
        arm64: kaslr: ensure randomized quantities are clean to the PoC
        arm64: kpti: Update arm64_kernel_use_ng_mappings() when forced on
      b0f3e768
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 6436408e
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2019-01-20
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix a out-of-bounds access in __bpf_redirect_no_mac, from Willem.
      
      2) Fix bpf_setsockopt to reset sock dst on SO_MARK changes, from Peter.
      
      3) Fix map in map masking to prevent out-of-bounds access under
         speculative execution, from Daniel.
      
      4) Fix bpf_setsockopt's SO_MAX_PACING_RATE to support TCP internal
         pacing, from Yuchung.
      
      5) Fix json writer license in bpftool, from Thomas.
      
      6) Fix AF_XDP to check if an actually queue exists during umem
         setup, from Krzysztof.
      
      7) Several fixes to BPF stackmap's build id handling. Another fix
         for bpftool build to account for libbfd variations wrt linking
         requirements, from Stanislav.
      
      8) Fix BPF samples build with clang by working around missing asm
         goto, from Yonghong.
      
      9) Fix libbpf to retry program load on signal interrupt, from Lorenz.
      
      10) Various minor compile warning fixes in BPF code, from Mathieu.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6436408e
    • Willem de Bruijn's avatar
      bpf: in __bpf_redirect_no_mac pull mac only if present · e7c87bd6
      Willem de Bruijn authored
      Syzkaller was able to construct a packet of negative length by
      redirecting from bpf_prog_test_run_skb with BPF_PROG_TYPE_LWT_XMIT:
      
          BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:345 [inline]
          BUG: KASAN: slab-out-of-bounds in skb_copy_from_linear_data include/linux/skbuff.h:3421 [inline]
          BUG: KASAN: slab-out-of-bounds in __pskb_copy_fclone+0x2dd/0xeb0 net/core/skbuff.c:1395
          Read of size 4294967282 at addr ffff8801d798009c by task syz-executor2/12942
      
          kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
          check_memory_region_inline mm/kasan/kasan.c:260 [inline]
          check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
          memcpy+0x23/0x50 mm/kasan/kasan.c:302
          memcpy include/linux/string.h:345 [inline]
          skb_copy_from_linear_data include/linux/skbuff.h:3421 [inline]
          __pskb_copy_fclone+0x2dd/0xeb0 net/core/skbuff.c:1395
          __pskb_copy include/linux/skbuff.h:1053 [inline]
          pskb_copy include/linux/skbuff.h:2904 [inline]
          skb_realloc_headroom+0xe7/0x120 net/core/skbuff.c:1539
          ipip6_tunnel_xmit net/ipv6/sit.c:965 [inline]
          sit_tunnel_xmit+0xe1b/0x30d0 net/ipv6/sit.c:1029
          __netdev_start_xmit include/linux/netdevice.h:4325 [inline]
          netdev_start_xmit include/linux/netdevice.h:4334 [inline]
          xmit_one net/core/dev.c:3219 [inline]
          dev_hard_start_xmit+0x295/0xc90 net/core/dev.c:3235
          __dev_queue_xmit+0x2f0d/0x3950 net/core/dev.c:3805
          dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
          __bpf_tx_skb net/core/filter.c:2016 [inline]
          __bpf_redirect_common net/core/filter.c:2054 [inline]
          __bpf_redirect+0x5cf/0xb20 net/core/filter.c:2061
          ____bpf_clone_redirect net/core/filter.c:2094 [inline]
          bpf_clone_redirect+0x2f6/0x490 net/core/filter.c:2066
          bpf_prog_41f2bcae09cd4ac3+0xb25/0x1000
      
      The generated test constructs a packet with mac header, network
      header, skb->data pointing to network header and skb->len 0.
      
      Redirecting to a sit0 through __bpf_redirect_no_mac pulls the
      mac length, even though skb->data already is at skb->network_header.
      bpf_prog_test_run_skb has already pulled it as LWT_XMIT !is_l2.
      
      Update the offset calculation to pull only if skb->data differs
      from skb->network_header, which is not true in this case.
      
      The test itself can be run only from commit 1cf1cae9 ("bpf:
      introduce BPF_PROG_TEST_RUN command"), but the same type of packets
      with skb at network header could already be built from lwt xmit hooks,
      so this fix is more relevant to that commit.
      
      Also set the mac header on redirect from LWT_XMIT, as even after this
      change to __bpf_redirect_no_mac that field is expected to be set, but
      is not yet in ip_finish_output2.
      
      Fixes: 3a0af8fd ("bpf: BPF for lightweight tunnel infrastructure")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e7c87bd6
    • Michael S. Tsirkin's avatar
      virtio_net: bulk free tx skbs · df133f3f
      Michael S. Tsirkin authored
      Use napi_consume_skb() to get bulk free.  Note that napi_consume_skb is
      safe to call in a non-napi context as long as the napi_budget flag is
      correct.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df133f3f
  6. 19 Jan, 2019 4 commits
    • Linus Torvalds's avatar
      Merge tag 'mips_fixes_5.0_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · 5d5c303e
      Linus Torvalds authored
      Pull MIPS fixes from Paul Burton:
      
       - Fix IPI handling for Lantiq SoCs, which was broken by changes made
         back in v4.12.
      
       - Enable OF/DT serial support in ath79_defconfig to give us working
         serial by default.
      
       - Fix 64b builds for the Jazz platform.
      
       - Set up a struct device for the BCM47xx SoC to allow BCM47xx drivers
         to perform DMA again following the major DMA mapping changes made in
         v4.19.
      
       - Disable MSI on Cavium Octeon systems when the pcie_disable command
         line parameter introduced in v3.3 is used, in order to avoid
         inadvetently accessing PCIe controller registers despite the command
         line.
      
       - Fix a build failure for Cavium Octeon kernels with kexec enabled,
         introduced in v4.20.
      
       - Fix a regression in the behaviour of semctl/shmctl/msgctl IPC
         syscalls for kernels including n32 support but not o32 support caused
         by some cleanup in v3.19.
      
      * tag 'mips_fixes_5.0_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: OCTEON: fix kexec support
        mips: fix n32 compat_ipc_parse_version
        Disable MSI also when pcie-octeon.pcie_disable on
        MIPS: BCM47XX: Setup struct device for the SoC
        MIPS: jazz: fix 64bit build
        MIPS: ath79: Enable OF serial ports in the default config
        MIPS: lantiq: Use CP0_LEGACY_COMPARE_IRQ
        MIPS: lantiq: Fix IPI interrupt handling
      5d5c303e
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 6a0141a0
      Linus Torvalds authored
      Pull Devicetree fix from Rob Herring:
       "A single build fix for powerpc due to device_node.type removal"
      
      * tag 'devicetree-fixes-for-5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        powerpc: chrp: Use of_node_is_type to access device_type
      6a0141a0
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-fixes-5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 26caabbc
      Linus Torvalds authored
      Pull libnvdimm fixes from Dan Williams:
       "A crash fix, a build warning fix, a miscellaneous small cleanups.
      
        In case anyone is looking for them, there was a regression caught by
        testing that caused two patches to be dropped from this update.  Those
        patches have been reworked and will soak for another week / re-target
        5.0-rc4.
      
         - Fix driver initialization crash due to the inability to report an
           'error' state for a DIMM's security capability.
      
         - Build warning fix for little-endian ARM64 builds
      
         - Fix a potential race between the EDAC driver's usage of the NFIT
           SMBIOS id for a DIMM and the driver shutdown path.
      
         - A small collection of one-line benign cleanups for duplicate
           variable assignments, a duplicate header include and a mis-typed
           function argument"
      
      * tag 'libnvdimm-fixes-5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        libnvdimm/security: Fix nvdimm_security_state() state request selection
        acpi/nfit: Remove duplicate set nd_set in acpi_nfit_init_interleave_set()
        acpi/nfit: Fix race accessing memdev in nfit_get_smbios_id()
        libnvdimm/dimm: Fix security capability detection for non-Intel NVDIMMs
        nfit: Mark some functions as __maybe_unused
        ACPI/nfit: delete the function to_acpi_nfit_desc
        ACPI/nfit: delete the redundant header file
      26caabbc
    • Linus Torvalds's avatar
      Merge tag 'linux-watchdog-5.0-rc-fixes' of git://www.linux-watchdog.org/linux-watchdog · f403d718
      Linus Torvalds authored
      Pull watchdog fixes from Wim Van Sebroeck:
      
       - mt7621_wdt/rt2880_wdt: Fix compilation problem
      
       - tqmx86: Fix a couple IS_ERR() vs NULL bugs
      
      * tag 'linux-watchdog-5.0-rc-fixes' of git://www.linux-watchdog.org/linux-watchdog:
        watchdog: tqmx86: Fix a couple IS_ERR() vs NULL bugs
        watchdog: mt7621_wdt/rt2880_wdt: Fix compilation problem
      f403d718