1. 05 May, 2017 1 commit
    • Juri Lelli's avatar
      cpufreq: schedutil: use now as reference when aggregating shared policy requests · d86ab9cf
      Juri Lelli authored
      Currently, sugov_next_freq_shared() uses last_freq_update_time as a
      reference to decide when to start considering CPU contributions as
      stale.
      
      However, since last_freq_update_time is set by the last CPU that issued
      a frequency transition, this might cause problems in certain cases. In
      practice, the detection of stale utilization values fails whenever the
      CPU with such values was the last to update the policy. For example (and
      please note again that the SCHED_CPUFREQ_RT flag is not the problem
      here, but only the detection of after how much time that flag has to be
      considered stale), suppose a policy with 2 CPUs:
      
                     CPU0                |               CPU1
                                         |
                                         |     RT task scheduled
                                         |     SCHED_CPUFREQ_RT is set
                                         |     CPU1->last_update = now
                                         |     freq transition to max
                                         |     last_freq_update_time = now
                                         |
      
                              more than TICK_NSEC nsecs
      
                                         |
           a small CFS wakes up          |
           CPU0->last_update = now1      |
           delta_ns(CPU0) < TICK_NSEC*   |
           CPU0's util is considered     |
           delta_ns(CPU1) =              |
            last_freq_update_time -      |
            CPU1->last_update = 0        |
            < TICK_NSEC                  |
           CPU1 is still considered      |
           CPU1->SCHED_CPUFREQ_RT is set |
           we stay at max (until CPU1    |
           exits from idle)              |
      
      * delta_ns is actually negative as now1 > last_freq_update_time
      
      While last_freq_update_time is a sensible reference for rate limiting,
      it doesn't seem to be useful for working around stale CPU states.
      
      Fix the problem by always considering now (time) as the reference for
      deciding when CPUs have stale contributions.
      Signed-off-by: default avatarJuri Lelli <juri.lelli@arm.com>
      Acked-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      d86ab9cf
  2. 17 Apr, 2017 1 commit
  3. 16 Apr, 2017 1 commit
  4. 13 Apr, 2017 1 commit
  5. 29 Mar, 2017 1 commit
  6. 28 Mar, 2017 16 commits
  7. 25 Mar, 2017 1 commit
  8. 24 Mar, 2017 6 commits
    • Rafael J. Wysocki's avatar
      cpufreq: intel_pstate: Avoid transient updates of cpuinfo.max_freq · 80b120ca
      Rafael J. Wysocki authored
      Both intel_pstate_verify_policy() and intel_cpufreq_verify_policy()
      set policy->cpuinfo.max_freq depending on the turbo status, but the
      updates made by them are discarded by the core, because the policy
      object passed to them by the core is temporary and cpuinfo.max_freq
      from that object is not copied to the final policy object in
      cpufreq_set_policy().
      
      However, cpufreq_set_policy() passes the temporary policy object
      to the ->setpolicy callback of the driver, so intel_pstate_set_policy()
      actually sees the policy->cpuinfo.max_freq value updated by
      intel_pstate_verify_policy() and not the final one.  It also
      updates policy->max sometimes which basically has no effect after
      it returns, because the core discards that update.
      
      To avoid confusion, eliminate policy->cpuinfo.max_freq updates from
      intel_pstate_verify_policy() and intel_cpufreq_verify_policy()
      entirely and check the maximum frequency explicitly in
      intel_pstate_update_perf_limits() instead of relying on the
      transiently updated policy->cpuinfo.max_freq value.
      
      Moreover, move the max->policy adjustment carried out in
      intel_pstate_set_policy() to a separate function and call that
      function from the ->verify driver callbacks to ensure that it will
      actually be effective.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      80b120ca
    • Rafael J. Wysocki's avatar
      cpufreq: intel_pstate: Active mode P-state limits rework · c5a2ee7d
      Rafael J. Wysocki authored
      The coordination of P-state limits used by intel_pstate in the active
      mode (ie. by default) is problematic, because it synchronizes all of
      the limits (ie. the global ones and the per-policy ones) so as to use
      one common pair of P-state limits (min and max) across all CPUs in
      the system.  The drawbacks of that are as follows:
      
       - If P-states are coordinated in hardware, it is not necessary
         to coordinate them in software on top of that, so in that case
         all of the above activity is in vain.
      
       - If P-states are not coordinated in hardware, then the processor
         is actually capable of setting different P-states for different
         CPUs and coordinating them at the software level simply doesn't
         allow that capability to be utilized.
      
       - The coordination works in such a way that setting a per-policy
         limit (eg. scaling_max_freq) for one CPU causes the common
         effective limit to change (and it will affect all of the other
         CPUs too), but subsequent reads from the corresponding sysfs
         attributes for the other CPUs will return stale values (which
         is confusing).
      
       - Reads from the global P-state limit attributes, min_perf_pct and
         max_perf_pct, return the effective common values and not the last
         values set through these attributes.  However, the last values
         set through these attributes become hard limits that cannot be
         exceeded by writes to scaling_min_freq and scaling_max_freq,
         respectively, and they are not exposed, so essentially users
         have to remember what they are.
      
      All of that is painful enough to warrant a change of the management
      of P-state limits in the active mode.
      
      To that end, redesign the active mode P-state limits management in
      intel_pstate in accordance with the following rules:
      
       (1) All CPUs are affected by the global limits (that is, none of
           them can be requested to run faster than the global max and
           none of them can be requested to run slower than the global
           min).
      
       (2) Each individual CPU is affected by its own per-policy limits
           (that is, it cannot be requested to run faster than its own
           per-policy max and it cannot be requested to run slower than
           its own per-policy min).
      
       (3) The global and per-policy limits can be set independently.
      
      Also, the global maximum and minimum P-state limits will be always
      expressed as percentages of the maximum supported turbo P-state.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c5a2ee7d
    • Rafael J. Wysocki's avatar
      cpufreq: intel_pstate: Use load-based P-state selection more widely · 55395345
      Rafael J. Wysocki authored
      Extend the set of systems for which intel_pstate will use the
      "powersave" P-state selection algorithm based on CPU load in the
      active mode by systems with ACPI preferred profile set to "tablet",
      "appliance PC", "desktop", or "workstation" (ie. everything with a
      specified preferred profile that is not a "server").
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      55395345
    • Rafael J. Wysocki's avatar
      cpufreq: intel_pstate: Support HWP processors in all operation modes · eb5139d1
      Rafael J. Wysocki authored
      Currently, some processors supporting HWP are only supported by
      intel_pstate if HWP is actually going to be used and not supported
      otherwise which is confusing.
      
      Specifically, they are not supported if "intel_pstate=no_hwp" is
      passed to the kernel in the command line or if the driver is started
      in the passive mode ("intel_pstate=passive").
      
      There is no real reason for that, because everything about those
      processor is known anyway and the driver can work with them in all
      modes, so make that happen, but use the load-based P-state selection
      algorithm for the active mode "powersave" policy with them.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      eb5139d1
    • Rafael J. Wysocki's avatar
      f1a91645
    • Rafael J. Wysocki's avatar
      cpufreq: schedutil: Trace frequency only if it has changed · 38d4ea22
      Rafael J. Wysocki authored
      sugov_update_commit() calls trace_cpu_frequency() to record the
      current CPU frequency if it has not changed in the fast switch case
      to prevent utilities from getting confused (they may report that the
      CPU is idle if the frequency has not been recorded for too long, for
      example).
      
      However, that may cause the tracepoint to be triggered quite often
      for no real reason (if the frequency doesn't change, we will not
      modify the last update time stamp and governor computations may
      run again shortly when that happens), so don't do that (arguably, it
      is done to work around a utilities bug anyway).
      
      That allows code duplication in sugov_update_commit() to be reduced
      somewhat too.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      38d4ea22
  9. 23 Mar, 2017 1 commit
    • Rafael J. Wysocki's avatar
      cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely · b7eaf1aa
      Rafael J. Wysocki authored
      The way the schedutil governor uses the PELT metric causes it to
      underestimate the CPU utilization in some cases.
      
      That can be easily demonstrated by running kernel compilation on
      a Sandy Bridge Intel processor, running turbostat in parallel with
      it and looking at the values written to the MSR_IA32_PERF_CTL
      register.  Namely, the expected result would be that when all CPUs
      were 100% busy, all of them would be requested to run in the maximum
      P-state, but observation shows that this clearly isn't the case.
      The CPUs run in the maximum P-state for a while and then are
      requested to run slower and go back to the maximum P-state after
      a while again.  That causes the actual frequency of the processor to
      visibly oscillate below the sustainable maximum in a jittery fashion
      which clearly is not desirable.
      
      That has been attributed to CPU utilization metric updates on task
      migration that cause the total utilization value for the CPU to be
      reduced by the utilization of the migrated task.  If that happens,
      the schedutil governor may see a CPU utilization reduction and will
      attempt to reduce the CPU frequency accordingly right away.  That
      may be premature, though, for example if the system is generally
      busy and there are other runnable tasks waiting to be run on that
      CPU already.
      
      This is unlikely to be an issue on systems where cpufreq policies are
      shared between multiple CPUs, because in those cases the policy
      utilization is computed as the maximum of the CPU utilization values
      over the whole policy and if that turns out to be low, reducing the
      frequency for the policy most likely is a good idea anyway.  On
      systems with one CPU per policy, however, it may affect performance
      adversely and even lead to increased energy consumption in some cases.
      
      On those systems it may be addressed by taking another utilization
      metric into consideration, like whether or not the CPU whose
      frequency is about to be reduced has been idle recently, because if
      that's not the case, the CPU is likely to be busy in the near future
      and its frequency should not be reduced.
      
      To that end, use the counter of idle calls in the timekeeping code.
      Namely, make the schedutil governor look at that counter for the
      current CPU every time before its frequency is about to be reduced.
      If the counter has not changed since the previous iteration of the
      governor computations for that CPU, the CPU has been busy for all
      that time and its frequency should not be decreased, so if the new
      frequency would be lower than the one set previously, the governor
      will skip the frequency update.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: default avatarJoel Fernandes <joelaf@google.com>
      b7eaf1aa
  10. 21 Mar, 2017 2 commits
    • Rafael J. Wysocki's avatar
      cpufreq: intel_pstate: Fix policy data management in passive mode · 64897b20
      Rafael J. Wysocki authored
      The policy->cpuinfo.max_freq and policy->max updates in
      intel_cpufreq_turbo_update() are excessive as they are done for no
      good reason and may lead to problems in principle, so they should be
      dropped.  However, after dropping them intel_cpufreq_turbo_update()
      becomes almost entirely pointless, because the check made by it is
      made again down the road in intel_pstate_prepare_request().  The
      only thing in it that still needs to be done is the call to
      update_turbo_state(), so drop intel_cpufreq_turbo_update() altogether
      and make its callers invoke update_turbo_state() directly instead of
      it.
      
      In addition to that, fix intel_cpufreq_verify_policy() so that it
      checks global.no_turbo in addition to global.turbo_disabled when
      updating policy->cpuinfo.max_freq to make it consistent with
      intel_pstate_verify_policy().
      
      Fixes: 001c76f0 (cpufreq: intel_pstate: Generic governors support)
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      64897b20
    • Rafael J. Wysocki's avatar
      cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() · 4296f23e
      Rafael J. Wysocki authored
      sugov_start() only initializes struct sugov_cpu per-CPU structures
      for shared policies, but it should do that for single-CPU policies too.
      
      That in particular makes the IO-wait boost mechanism work in the
      cases when cpufreq policies correspond to individual CPUs.
      
      Fixes: 21ca6d2c (cpufreq: schedutil: Add iowait boosting)
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Cc: 4.9+ <stable@vger.kernel.org> # 4.9+
      4296f23e
  11. 20 Mar, 2017 5 commits
    • Linus Torvalds's avatar
      Linux 4.11-rc3 · 97da3854
      Linus Torvalds authored
      97da3854
    • Linus Torvalds's avatar
      mm/swap: don't BUG_ON() due to uninitialized swap slot cache · 452b94b8
      Linus Torvalds authored
      This BUG_ON() triggered for me once at shutdown, and I don't see a
      reason for the check.  The code correctly checks whether the swap slot
      cache is usable or not, so an uninitialized swap slot cache is not
      actually problematic afaik.
      
      I've temporarily just switched the BUG_ON() to a WARN_ON_ONCE(), since
      I'm not sure why that seemingly pointless check was there.  I suspect
      the real fix is to just remove it entirely, but for now we'll warn about
      it but not bring the machine down.
      
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      452b94b8
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · a07a6e41
      Linus Torvalds authored
      Pull more powerpc fixes from Michael Ellerman:
       "A couple of minor powerpc fixes for 4.11:
      
         - wire up statx() syscall
      
         - don't print a warning on memory hotplug when HPT resizing isn't
           available
      
        Thanks to: David Gibson, Chandan Rajendra"
      
      * tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/pseries: Don't give a warning when HPT resizing isn't available
        powerpc: Wire up statx() syscall
      a07a6e41
    • Linus Torvalds's avatar
      Merge branch 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 4571bc5a
      Linus Torvalds authored
      Pull parisc fixes from Helge Deller:
      
       - Mikulas Patocka added support for R_PARISC_SECREL32 relocations in
         modules with CONFIG_MODVERSIONS.
      
       - Dave Anglin optimized the cache flushing for vmap ranges.
      
       - Arvind Yadav provided a fix for a potential NULL pointer dereference
         in the parisc perf code (and some code cleanups).
      
       - I wired up the new statx system call, fixed some compiler warnings
         with the access_ok() macro and fixed shutdown code to really halt a
         system at shutdown instead of crashing & rebooting.
      
      * 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Fix system shutdown halt
        parisc: perf: Fix potential NULL pointer dereference
        parisc: Avoid compiler warnings with access_ok()
        parisc: Wire up statx system call
        parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range
        parisc: support R_PARISC_SECREL32 relocation in modules
      4571bc5a
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 8aa34172
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "The bulk of the changes are in qla2xxx target driver code to address
        various issues found during Cavium/QLogic's internal testing (stable
        CC's included), along with a few other stability and smaller
        miscellaneous improvements.
      
        There are also a couple of different patch sets from Mike Christie,
        which have been a result of his work to use target-core ALUA logic
        together with tcm-user backend driver.
      
        Finally, a patch to address some long standing issues with
        pass-through SCSI export of TYPE_TAPE + TYPE_MEDIUM_CHANGER devices,
        which will make folks using physical (or virtual) magnetic tape happy"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (28 commits)
        qla2xxx: Update driver version to 9.00.00.00-k
        qla2xxx: Fix delayed response to command for loop mode/direct connect.
        qla2xxx: Change scsi host lookup method.
        qla2xxx: Add DebugFS node to display Port Database
        qla2xxx: Use IOCB interface to submit non-critical MBX.
        qla2xxx: Add async new target notification
        qla2xxx: Export DIF stats via debugfs
        qla2xxx: Improve T10-DIF/PI handling in driver.
        qla2xxx: Allow relogin to proceed if remote login did not finish
        qla2xxx: Fix sess_lock & hardware_lock lock order problem.
        qla2xxx: Fix inadequate lock protection for ABTS.
        qla2xxx: Fix request queue corruption.
        qla2xxx: Fix memory leak for abts processing
        qla2xxx: Allow vref count to timeout on vport delete.
        tcmu: Convert cmd_time_out into backend device attribute
        tcmu: make cmd timeout configurable
        tcmu: add helper to check if dev was configured
        target: fix race during implicit transition work flushes
        target: allow userspace to set state to transitioning
        target: fix ALUA transition timeout handling
        ...
      8aa34172
  12. 19 Mar, 2017 4 commits