1. 29 Sep, 2017 22 commits
    • Peter Zijlstra's avatar
      sched/fair: Propagate an effective runnable_load_avg · 1ea6c46a
      Peter Zijlstra authored
      The load balancer uses runnable_load_avg as load indicator. For
      !cgroup this is:
      
        runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
      
      That is, a direct sum of all runnable tasks on that runqueue. As
      opposed to load_avg, which is a sum of all tasks on the runqueue,
      which includes a blocked component.
      
      However, in the cgroup case, this comes apart since the group entities
      are always runnable, even if most of their constituent entities are
      blocked.
      
      Therefore introduce a runnable_weight which for task entities is the
      same as the regular weight, but for group entities is a fraction of
      the entity weight and represents the runnable part of the group
      runqueue.
      
      Then propagate this load through the PELT hierarchy to arrive at an
      effective runnable load avgerage -- which we should not confuse with
      the canonical runnable load average.
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1ea6c46a
    • Peter Zijlstra's avatar
      sched/fair: Rewrite PELT migration propagation · 0e2d2aaa
      Peter Zijlstra authored
      When an entity migrates in (or out) of a runqueue, we need to add (or
      remove) its contribution from the entire PELT hierarchy, because even
      non-runnable entities are included in the load average sums.
      
      In order to do this we have some propagation logic that updates the
      PELT tree, however the way it 'propagates' the runnable (or load)
      change is (more or less):
      
                           tg->weight * grq->avg.load_avg
        ge->avg.load_avg = ------------------------------
                                     tg->load_avg
      
      But that is the expression for ge->weight, and per the definition of
      load_avg:
      
        ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
      
      That destroys the runnable_avg (by setting it to 1) we wanted to
      propagate.
      
      Instead directly propagate runnable_sum.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0e2d2aaa
    • Peter Zijlstra's avatar
      sched/fair: Rewrite cfs_rq->removed_*avg · 2a2f5d4e
      Peter Zijlstra authored
      Since on wakeup migration we don't hold the rq->lock for the old CPU
      we cannot update its state. Instead we add the removed 'load' to an
      atomic variable and have the next update on that CPU collect and
      process it.
      
      Currently we have 2 atomic variables; which already have the issue
      that they can be read out-of-sync. Also, two atomic ops on a single
      cacheline is already more expensive than an uncontended lock.
      
      Since we want to add more, convert the thing over to an explicit
      cacheline with a lock in.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2a2f5d4e
    • Vincent Guittot's avatar
      sched/fair: Use reweight_entity() for set_user_nice() · 9059393e
      Vincent Guittot authored
      Now that we directly change load_avg and propagate that change into
      the sums, sys_nice() and co should do the same, otherwise its possible
      to confuse load accounting when we migrate near the weight change.
      Fixes-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      [ Added changelog, fixed the call condition. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9059393e
    • Peter Zijlstra's avatar
      sched/fair: More accurate reweight_entity() · 840c5abc
      Peter Zijlstra authored
      When a (group) entity changes it's weight we should instantly change
      its load_avg and propagate that change into the sums it is part of.
      Because we use these values to predict future behaviour and are not
      interested in its historical value.
      
      Without this change, the change in load would need to propagate
      through the average, by which time it could again have changed etc..
      always chasing itself.
      
      With this change, the cfs_rq load_avg sum will more accurately reflect
      the current runnable and expected return of blocked load.
      Reported-by: default avatarPaul Turner <pjt@google.com>
      [josef: compile fix !SMP || !FAIR_GROUP]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      840c5abc
    • Peter Zijlstra's avatar
      sched/fair: Introduce {en,de}queue_load_avg() · 8d5b9025
      Peter Zijlstra authored
      Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
      for {en,de}queue_load_avg(). More users will follow.
      
      Includes some code movement to avoid fwd declarations.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8d5b9025
    • Peter Zijlstra's avatar
      sched/fair: Rename {en,de}queue_entity_load_avg() · b5b3e35f
      Peter Zijlstra authored
      Since they're now purely about runnable_load, rename them.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b5b3e35f
    • Peter Zijlstra's avatar
      sched/fair: Move enqueue migrate handling · b382a531
      Peter Zijlstra authored
      Move the entity migrate handling from enqueue_entity_load_avg() to
      update_load_avg(). This has two benefits:
      
       - {en,de}queue_entity_load_avg() will become purely about managing
         runnable_load
      
       - we can avoid a double update_tg_load_avg() and reduce pressure on
         the global tg->shares cacheline
      
      The reason we do this is so that we can change update_cfs_shares() to
      change both weight and (future) runnable_weight. For this to work we
      need to have the cfs_rq averages up-to-date (which means having done
      the attach), but we need the cfs_rq->avg.runnable_avg to not yet
      include the se's contribution (since se->on_rq == 0).
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b382a531
    • Peter Zijlstra's avatar
      sched/fair: Change update_load_avg() arguments · 88c0616e
      Peter Zijlstra authored
      Most call sites of update_load_avg() already have cfs_rq_of(se)
      available, pass it down instead of recomputing it.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      88c0616e
    • Peter Zijlstra's avatar
      sched/fair: Remove se->load.weight from se->avg.load_sum · c7b50216
      Peter Zijlstra authored
      Remove the load from the load_sum for sched_entities, basically
      turning load_sum into runnable_sum.  This prepares for better
      reweighting of group entities.
      
      Since we now have different rules for computing load_avg, split
      ___update_load_avg() into two parts, ___update_load_sum() and
      ___update_load_avg().
      
      So for se:
      
        ___update_load_sum(.weight = 1)
        ___upate_load_avg(.weight = se->load.weight)
      
      and for cfs_rq:
      
        ___update_load_sum(.weight = cfs_rq->load.weight)
        ___upate_load_avg(.weight = 1)
      
      Since the primary consumable is load_avg, most things will not be
      affected. Only those few sites that initialize/modify load_sum need
      attention.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c7b50216
    • Peter Zijlstra's avatar
      sched/fair: Cure calc_cfs_shares() vs. reweight_entity() · 3d4b60d3
      Peter Zijlstra authored
      Vincent reported that when running in a cgroup, his root
      cfs_rq->avg.load_avg dropped to 0 on task idle.
      
      This is because reweight_entity() will now immediately propagate the
      weight change of the group entity to its cfs_rq, and as it happens,
      our approxmation (5) for calc_cfs_shares() results in 0 when the group
      is idle.
      
      Avoid this by using the correct (3) as a lower bound on (5). This way
      the empty cgroup will slowly decay instead of instantly drop to 0.
      Reported-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3d4b60d3
    • Peter Zijlstra's avatar
      sched/fair: Add comment to calc_cfs_shares() · cef27403
      Peter Zijlstra authored
      Explain the magic equation in calc_cfs_shares() a bit better.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cef27403
    • Peter Zijlstra's avatar
      sched/fair: Clean up calc_cfs_shares() · 7c80cfc9
      Peter Zijlstra authored
      For consistencies sake, we should have only a single reading of tg->shares.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7c80cfc9
    • Ethan Zhao's avatar
      sched/sysctl: Check user input value of sysctl_sched_time_avg · 5ccba44b
      Ethan Zhao authored
      System will hang if user set sysctl_sched_time_avg to 0:
      
        [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0
      
        Stack traceback for pid 0
        0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
        ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
        0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
        ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
        Call Trace:
        <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200
        [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600
        [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530
        [<ffffffff810c5b84>] ? load_balance+0x194/0x900
        [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0
        [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290
        [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0
        [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320
        [<ffffffff8108ac85>] ? irq_exit+0x125/0x130
        [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160
        [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30
        [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80
         <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230
        [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230
        [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20
        [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420
        [<ffffffff81053373>] ? start_secondary+0x173/0x1e0
      
      Because divide-by-zero error happens in function:
      
      update_group_capacity()
        update_cpu_capacity()
          scale_rt_capacity()
           {
                ...
                total = sched_avg_period() + delta;
                used = div_u64(avg, total);
                ...
           }
      
      To fix this issue, check user input value of sysctl_sched_time_avg, keep
      it unchanged when hitting invalid input, and set the minimum limit of
      sysctl_sched_time_avg to 1 ms.
      Reported-by: default avatarJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Signed-off-by: default avatarEthan Zhao <ethan.zhao@oracle.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: ethan.kernel@gmail.com
      Cc: keescook@chromium.org
      Cc: mcgrof@kernel.org
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5ccba44b
    • Peter Zijlstra's avatar
      sched/debug: Add explicit TASK_PARKED printing · 8ef9925b
      Peter Zijlstra authored
      Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
      its own print state because it will not in fact get woken by regular
      wakeups and is a long-term state.
      
      This requires moving TASK_PARKED into the TASK_REPORT mask, and since
      that latter needs to be a contiguous bitmask, we need to shuffle the
      bits around a bit.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8ef9925b
    • Peter Zijlstra's avatar
      sched/debug: Ignore TASK_IDLE for SysRq-W · 5d68cc95
      Peter Zijlstra authored
      Markus reported that tasks in TASK_IDLE state are reported by SysRq-W,
      which results in undesirable clutter.
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5d68cc95
    • Peter Zijlstra's avatar
      sched/debug: Add explicit TASK_IDLE printing · 06eb6184
      Peter Zijlstra authored
      Markus reported that kthreads that idle using TASK_IDLE instead of
      TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
      like htop mark those red.
      
      This is undesirable, so add an explicit state for TASK_IDLE.
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      06eb6184
    • Peter Zijlstra's avatar
      sched/tracing: Use common task-state helpers · 5f6ad26e
      Peter Zijlstra authored
      Remove yet another task-state char instance.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5f6ad26e
    • Peter Zijlstra's avatar
      sched/tracing: Fix trace_sched_switch task-state printing · efb40f58
      Peter Zijlstra authored
      Convert trace_sched_switch to use the common task-state helpers and
      fix the "X" and "Z" order, possibly they ended up in the wrong order
      because TASK_REPORT has them in the wrong order too.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      efb40f58
    • Peter Zijlstra's avatar
      sched/debug: Remove unused variable · 65d5dc47
      Peter Zijlstra authored
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      65d5dc47
    • Peter Zijlstra's avatar
      sched/debug: Convert TASK_state to hex · 92c4bc9f
      Peter Zijlstra authored
      Bit patterns are easier in hex.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      92c4bc9f
    • Peter Zijlstra's avatar
      sched/debug: Implement consistent task-state printing · 1593baab
      Peter Zijlstra authored
      Currently get_task_state() and task_state_to_char() report different
      states, create a number of common helpers and unify the reported state
      space.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1593baab
  2. 28 Sep, 2017 10 commits
    • Linus Torvalds's avatar
      Merge tag 'acpi-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 770b782f
      Linus Torvalds authored
      Pull ACPI fix from Rafael Wysocki:
       "This fixes an APEI problem that may cause a reported error to be
        missed due to a race condition"
      
      * tag 'acpi-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI / APEI: clear error status before acknowledging the error
      770b782f
    • Linus Torvalds's avatar
      Merge tag 'pm-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 74de8187
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These fix a deadlock in the operating performance points (OPP)
        framework introduced during the 4.11 cycle, more issues with duplicate
        device objects for cpufreq-dt and cpufreq documentation.
      
        Specifics:
      
         - Fix a deadlock in the operating performance points (OPP) framework
           caused by a notifier callback taking a lock that's already held by
           its caller (Viresh Kumar).
      
         - Prevent the ti-cpufreq and cpufreq-dt-platdev drivers from
           attempting to register conflicting device objects which triggers a
           warning from sysfs (Suniel Mahesh).
      
         - Drop a stale reference to a piece of intel_pstate documentation
           that's not in the tree any more (Rafael Wysocki)"
      
      * tag 'pm-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: docs: Drop intel-pstate.txt from index.txt
        cpufreq: dt: Fix sysfs duplicate filename creation for platform-device
        PM / OPP: Call notifier without holding opp_table->lock
      74de8187
    • Linus Torvalds's avatar
      Merge tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 02a2b053
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
      
       - fix various problems with the copy-on-write extent maps getting freed
         at the wrong time
      
       - fix printk format specifier problems
      
       - report zeroing operation outcomes instead of dropping them on the
         floor
      
       - fix some crashes when dio operations partially fail
      
       - fix a race condition between unwritten extent conversion & dio read
      
       - fix some incorrect tests in the inode log item processing
      
       - correct the delayed allocation space reservations on rmap filesystems
      
       - fix some problems checking for dax support
      
      * tag 'xfs-4.14-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: revert "xfs: factor rmap btree size into the indlen calculations"
        xfs: Capture state of the right inode in xfs_iflush_done
        xfs: perag initialization should only touch m_ag_max_usable for AG 0
        xfs: update i_size after unwritten conversion in dio completion
        iomap_dio_rw: Allocate AIO completion queue before submitting dio
        xfs: validate bdev support for DAX inode flag
        xfs: remove redundant re-initialization of total_nr_pages
        xfs: Output warning message when discard option was enabled even though the device does not support discard
        xfs: report zeroed or not correctly in xfs_zero_range()
        xfs: kill meaningless variable 'zero'
        fs/xfs: Use %pS printk format for direct addresses
        xfs: evict CoW fork extents when performing finsert/fcollapse
        xfs: don't unconditionally clear the reflink flag on zero-block files
      02a2b053
    • Linus Torvalds's avatar
      Revert "Bluetooth: Add option for disabling legacy ioctl interfaces" · e49aa15e
      Linus Torvalds authored
      This reverts commit dbbccdc4.
      
      It turns out that the "legacy" users aren't so legacy at all, and that
      turning off the legacy ioctl will break the current Qt bluetooth stack
      for bluetooth LE devices that were released just a couple of months ago.
      
      So it's simply not true that this was a legacy interface that hasn't
      been needed and is only limited to old legacy BT devices.  Because I
      actually read Kconfig help messages, and actively try to turn off
      features that I don't need, I turned the option off.
      
      Then I spent _way_ too much time debugging BLE issues until I realized
      that it wasn't the Qt and subsurface development that had broken one of
      my dive computer BLE downloads, but simply my broken kernel config.
      
      Maybe in a decade it will be true that this is a legacy interface.  And
      maybe with a better help-text and correct dependencies, this kind of
      legacy removal might be acceptable.  But as things are right now both
      the commit message and the Kconfig help text were misleading, and the
      Kconfig option had the wrong dependenencies.
      
      There's no reason to keep that broken Kconfig option in the tree.
      
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Cc: Johan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e49aa15e
    • Rafael J. Wysocki's avatar
      Merge branch 'acpi-apei' · 333d1774
      Rafael J. Wysocki authored
      * acpi-apei:
        ACPI / APEI: clear error status before acknowledging the error
      333d1774
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · 91735832
      Linus Torvalds authored
      Pull rdma fixes from Doug Ledford:
       "Second -rc update for 4.14.
      
        Both Mellanox and Intel had a series of -rc fixes that landed this
        week. The Mellanox bunch is spread throughout the stack and not just
        in their driver, where as the Intel bunch was mostly in the hfi1
        driver. And, several of the fixes in the hfi1 driver were more than
        just simple 5 line fixes. As a result, the hfi1 driver fixes has a
        sizable LOC count.
      
        Everything else is as one would expect in an RC cycle in terms of LOC
        count. One item that might jump out and make you think "That's not an
        rc item" is the fix that corrects a typo. But, that change fixes a
        typo in a user visible API that was just added in this merge window,
        so if we fix it now, we can fix it. If we don't, the typo is in the
        API forever. Another that might not appear to be a fix at first glance
        is the Simplify mlx5_ib_cont_pages patch, but the simplification
        allows them to fix a bug in the existing function whenever the length
        of an SGE exceeded page size. We also had to revert one patch from the
        merge window that was wrong.
      
        Summary:
      
         - a few core fixes
         - a few ipoib fixes
         - a few mlx5 fixes
         - a 7-patch hfi1 related series"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
        IB/hfi1: Unsuccessful PCIe caps tuning should not fail driver load
        IB/hfi1: On error, fix use after free during user context setup
        Revert "IB/ipoib: Update broadcast object if PKey value was changed in index 0"
        IB/hfi1: Return correct value in general interrupt handler
        IB/hfi1: Check eeprom config partition validity
        IB/hfi1: Only reset QSFP after link up and turn off AOC TX
        IB/hfi1: Turn off AOC TX after offline substates
        IB/mlx5: Fix NULL deference on mlx5_ib_update_xlt failure
        IB/mlx5: Simplify mlx5_ib_cont_pages
        IB/ipoib: Fix inconsistency with free_netdev and free_rdma_netdev
        IB/ipoib: Fix sysfs Pkey create<->remove possible deadlock
        IB: Correct MR length field to be 64-bit
        IB/core: Fix qp_sec use after free access
        IB/core: Fix typo in the name of the tag-matching cap struct
      91735832
    • Rafael J. Wysocki's avatar
      Merge branches 'pm-opp' and 'pm-cpufreq' · abeb19a2
      Rafael J. Wysocki authored
      * pm-opp:
        PM / OPP: Call notifier without holding opp_table->lock
      
      * pm-cpufreq:
        cpufreq: docs: Drop intel-pstate.txt from index.txt
        cpufreq: dt: Fix sysfs duplicate filename creation for platform-device
      abeb19a2
    • Linus Torvalds's avatar
      Merge tag 'seccomp-v4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 26e811cd
      Linus Torvalds authored
      Pull seccomp fix from Kees Cook:
       "Fix refcounting bug in CRIU interface, noticed by Chris Salls (Oleg &
        Tycho)"
      
      * tag 'seccomp-v4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()
      26e811cd
    • Oleg Nesterov's avatar
      seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter() · 66a733ea
      Oleg Nesterov authored
      As Chris explains, get_seccomp_filter() and put_seccomp_filter() can end
      up using different filters. Once we drop ->siglock it is possible for
      task->seccomp.filter to have been replaced by SECCOMP_FILTER_FLAG_TSYNC.
      
      Fixes: f8e529ed ("seccomp, ptrace: add support for dumping seccomp filters")
      Reported-by: default avatarChris Salls <chrissalls5@gmail.com>
      Cc: stable@vger.kernel.org # needs s/refcount_/atomic_/ for v4.12 and earlier
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      [tycho: add __get_seccomp_filter vs. open coding refcount_inc()]
      Signed-off-by: default avatarTycho Andersen <tycho@docker.com>
      [kees: tweak commit log]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      66a733ea
    • Rafael J. Wysocki's avatar
      cpufreq: docs: Drop intel-pstate.txt from index.txt · 8aba2333
      Rafael J. Wysocki authored
      Commit 33fc30b4 (cpufreq: intel_pstate: Document the current
      behavior and user interface) dropped the intel-pstate.txt file
      from Documentation/cpu-freq/, but it did not update the index.txt
      file in there accordingly, so do that now.
      
      Fixes: 33fc30b4 (cpufreq: intel_pstate: Document the current behavior and user interface)
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      8aba2333
  3. 27 Sep, 2017 8 commits
    • Tyler Baicar's avatar
      ACPI / APEI: clear error status before acknowledging the error · aaf2c2fb
      Tyler Baicar authored
      Currently we acknowledge errors before clearing the error status.
      This could cause a new error to be populated by firmware in-between
      the error acknowledgment and the error status clearing which would
      cause the second error's status to be cleared without being handled.
      So, clear the error status before acknowledging the errors.
      
      Also, make sure to acknowledge the error if the error status read
      fails.
      Signed-off-by: default avatarTyler Baicar <tbaicar@codeaurora.org>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      aaf2c2fb
    • Linus Torvalds's avatar
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 9cd6681c
      Linus Torvalds authored
      Pull quota and isofs fixes from Jan Kara:
       "Two quota fixes (fallout of the quota locking changes) and an isofs
        build fix"
      
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        quota: Fix quota corruption with generic/232 test
        isofs: fix build regression
        quota: add missing lock into __dquot_transfer()
      9cd6681c
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-4.14-rc3-fixes' of... · 225d3b67
      Linus Torvalds authored
      Merge tag 'linux-kselftest-4.14-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest fixes from Shuah Khan:
       "This update consists of:
      
         - fixes to several existing tests
      
         - a test for regression introduced by b9470c27 ("inet: kill
           smallest_size and smallest_port")
      
         - seccomp support for glibc 2.26 siginfo_t.h
      
         - fixes to kselftest framework and tests to run make O=dir use-case
      
         - fixes to silence unnecessary test output to de-clutter test results"
      
      * tag 'linux-kselftest-4.14-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (28 commits)
        selftests: timers: set-timer-lat: Fix hang when testing unsupported alarms
        selftests: timers: set-timer-lat: fix hang when std out/err are redirected
        selftests/memfd: correct run_tests.sh permission
        selftests/seccomp: Support glibc 2.26 siginfo_t.h
        selftests: futex: Makefile: fix for loops in targets to run silently
        selftests: Makefile: fix for loops in targets to run silently
        selftests: mqueue: Use full path to run tests from Makefile
        selftests: futex: copy sub-dir test scripts for make O=dir run
        selftests: lib.mk: copy test scripts and test files for make O=dir run
        selftests: sync: kselftest and kselftest-clean fail for make O=dir case
        selftests: sync: use TEST_CUSTOM_PROGS instead of TEST_PROGS
        selftests: lib.mk: add TEST_CUSTOM_PROGS to allow custom test run/install
        selftests: watchdog: fix to use TEST_GEN_PROGS and remove clean
        selftests: lib.mk: fix test executable status check to use full path
        selftests: Makefile: clear LDFLAGS for make O=dir use-case
        selftests: lib.mk: kselftest and kselftest-clean fail for make O=dir case
        Makefile: kselftest and kselftest-clean fail for make O=dir case
        selftests/net: msg_zerocopy enable build with older kernel headers
        selftests: actually run the various net selftests
        selftest: add a reuseaddr test
        ...
      225d3b67
    • Linus Torvalds's avatar
      Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7031b641
      Linus Torvalds authored
      Pull x86 fpu fixes and cleanups from Ingo Molnar:
       "This is _way_ more cleanups than fixes, but the bugs were subtle and
        hard to hit, and the primary reason for them existing was the
        unnecessary historical complexity of some of the x86/fpu interfaces.
      
        The first bunch of commits clean up and simplify the xstate user copy
        handling functions, in reaction to the collective head-scratching
        about the xstate user-copy handling code that leads up to the fix for
        this SkyLake xstate handling bug:
      
           0852b374: x86/fpu: Add FPU state copying quirk to handle XRSTOR failure on Intel Skylake CPUs
      
        The cleanups don't change any functionality, they just (hopefully)
        make it all clearer, more consistent, more debuggable and more robust.
      
        Note that most of the linecount increase comes from these commits,
        where we better split the user/kernel copy logic by having more
        variants, instead repeated fragile patterns of:
      
                     if (kbuf) {
                             memcpy(kbuf + pos, data, copy);
                     } else {
                             if (__copy_to_user(ubuf + pos, data, copy))
                                     return -EFAULT;
                     }
      
        The next bunch of commits simplify the FPU state-machine to get rid of
        old lazy-FPU idiosyncrasies - a defensive simplification to make all
        the code easier to review and fix. No change in functionality.
      
        Then there's a couple of additional debugging tweaks: static checker
        warning fix and move an FPU related warning to under WARN_ON_FPU(),
        followed by another bunch of commits that represent a finegrained
        split-up of the fixes from Eric Biggers to handle weird xstate bits
        properly.
      
        I did this finegrained split-up because some of these fixes also
        impact the ABI for weird xstate handling, for which we'd like to have
        good bisection results, should they cause any problems. (We also had
        one regression with the more monolithic fixes, so splitting it all up
        sounded prudent for robustness reasons as well.)
      
        About the whole series: the commits up to 03eaec81 have been in
        -next for months - but I've recently rebased them to remove a state
        machine clean-up commit that was objected to, and to make it more
        bisectable - so technically it's a new, rebased tree.
      
        Robustness history: this series had some regressions along the way,
        and all reported regressions have been fixed. All but one of the
        regressions manifested itself as easy to report warnings. The previous
        version of this latest series was also in linux-next, with one
        (warning-only) regression reported which is fixed in the latest
        version.
      
        Barring last minute brown paper bag bugs (and the commits are now
        older by a day which I'd hope helps paperbag reduction), I'm
        reasonably confident about its general robustness.
      
        Famous last words ..."
      
      * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
        x86/fpu: Use using_compacted_format() instead of open coded X86_FEATURE_XSAVES
        x86/fpu: Use validate_xstate_header() to validate the xstate_header in copy_user_to_xstate()
        x86/fpu: Eliminate the 'xfeatures' local variable in copy_user_to_xstate()
        x86/fpu: Copy the full header in copy_user_to_xstate()
        x86/fpu: Use validate_xstate_header() to validate the xstate_header in copy_kernel_to_xstate()
        x86/fpu: Eliminate the 'xfeatures' local variable in copy_kernel_to_xstate()
        x86/fpu: Copy the full state_header in copy_kernel_to_xstate()
        x86/fpu: Use validate_xstate_header() to validate the xstate_header in __fpu__restore_sig()
        x86/fpu: Use validate_xstate_header() to validate the xstate_header in xstateregs_set()
        x86/fpu: Introduce validate_xstate_header()
        x86/fpu: Rename fpu__activate_fpstate_read/write() to fpu__prepare_[read|write]()
        x86/fpu: Rename fpu__activate_curr() to fpu__initialize()
        x86/fpu: Simplify and speed up fpu__copy()
        x86/fpu: Fix stale comments about lazy FPU logic
        x86/fpu: Rename fpu::fpstate_active to fpu::initialized
        x86/fpu: Remove fpu__current_fpstate_write_begin/end()
        x86/fpu: Fix fpu__activate_fpstate_read() and update comments
        x86/fpu: Reinitialize FPU registers if restoring FPU state fails
        x86/fpu: Don't let userspace set bogus xcomp_bv
        x86/fpu: Turn WARN_ON() in context switch into WARN_ON_FPU()
        ...
      7031b641
    • Harish Chegondi's avatar
      IB/hfi1: Unsuccessful PCIe caps tuning should not fail driver load · 828bcbdc
      Harish Chegondi authored
      Failure to tune PCIe capabilities should not fail driver load. This can
      cause the driver load to fail on systems with any of the following:
      1. HFI's parent is not root. Example: HFI card is behind a PCIe bridge.
      2. HFI's parent is not PCI Express capable.
      In these situations, failure to tune PCIe capabilities should be logged
      in the system message logs but not cause the driver load to fail.
      
      This patch also ensures pcie capability word DevCtl is written only
      after a successful read and the capability tuning process continues
      even if read/write of the pcie capability word DevCtl fails.
      
      Fixes: c53df62c ("IB/hfi1: Check return values from PCI config API calls")
      Fixes: bf70a775 ("staging/rdma/hfi1: Enable WFR PCIe extended tags from the driver")
      Reviewed-by: default avatarMichael J. Ruhl <michael.j.ruhl@intel.com>
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarJakub Byczkowski <jakub.byczkowski@intel.com>
      Signed-off-by: default avatarHarish Chegondi <harish.chegondi@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      828bcbdc
    • Michael J. Ruhl's avatar
      IB/hfi1: On error, fix use after free during user context setup · b8f42738
      Michael J. Ruhl authored
      During base context setup, if setup_base_ctxt() fails, the context is
      deallocated. This is incorrect because the context is referenced on
      return, to notify any waiting subcontext.  If there are no subcontexts
      the pointer will be invalid.
      
      Reorganize the error path so that deallocate_ctxt() is called after all
      the possible subcontexts have been notified.
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarMichael J. Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      b8f42738
    • Alex Estrin's avatar
      Revert "IB/ipoib: Update broadcast object if PKey value was changed in index 0" · 612601d0
      Alex Estrin authored
      commit 9a9b8112 will cause core to fail UD QP from being destroyed
      on ipoib unload, therefore cause resources leakage.
      On pkey change event above patch modifies mgid before calling underlying
      driver to detach it from QP. Drivers' detach_mcast() will fail to find
      modified mgid it was never given to attach in a first place.
      Core qp->usecnt will never go down, so ib_destroy_qp() will fail.
      
      IPoIB driver actually does take care of new broadcast mgid based on new
      pkey by destroying an old mcast object in ipoib_mcast_dev_flush())
      ....
      	if (priv->broadcast) {
      		rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree);
      		list_add_tail(&priv->broadcast->list, &remove_list);
      		priv->broadcast = NULL;
      	}
      ...
      
      then in restarted ipoib_macst_join_task() creating a new broadcast mcast
      object, sending join request and on completion tells the driver to attach
      to reinitialized QP:
      ...
      if (!priv->broadcast) {
      ...
      	broadcast = ipoib_mcast_alloc(dev, 0);
      ...
      	memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
      	       sizeof (union ib_gid));
      	priv->broadcast = broadcast;
      ...
      
      Fixes: 9a9b8112 ("IB/ipoib: Update broadcast object if PKey value was changed in index 0")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarAlex Estrin <alex.estrin@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarFeras Daoud <ferasda@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      612601d0
    • Kamenee Arumugam's avatar
      IB/hfi1: Return correct value in general interrupt handler · 09592af5
      Kamenee Arumugam authored
      The general interrupt handler returns IRQ_HANDLED whether an IRQ
      was handled or not.
      Determine if an IRQ was handled and return the correct value.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarMichael J. Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: default avatarKamenee Arumugam <kamenee.arumugam@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      09592af5