Merge tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fixes from Rafael Wysocki: "These are fixes mostly, for a few changes made in this cycle (the intel_idle driver, the OPP library, the ACPI EC driver, turbostat) and for some issues that have just been discovered (ACPI PCI IRQ management, PCI power management documentation, turbostat), with a couple of cleanups on top of them. Specifics: - intel_idle driver fixup for the recently added Skylake chips support (Len Brown). - Operating Performance Points (OPP) library fix related to the recently added support for new DT bindings and a fix for a typo in a comment (Viresh Kumar, Stephen Boyd). - ACPI EC driver fix for a recently introduced memory leak in an error code path (Lv Zheng). - ACPI PCI IRQ management fix for the issue where an ISA IRQ is shared with a PCI device which requires it to be configured in a different way and may cause an interrupt storm to happen as a result with an extra ACPI SCI IRQ handling simplification on top of it (Jiang Liu). - Update of the PCI power management documentation that became outdated and started to actively confuse the readers to make it actually reflect the code (Rafael J Wysocki). - turbostat fixes including an IVB Xeon regression fix (related to the --debug command line option), Skylake adjustment for the TSC running at a frequency that doesn't match the base one exactly, and a Knights Landing quirk to account for the fact that it only updates APERF and MPERF every 1024 clock cycles plus bumping up the turbostat version number (Len Brown, Hubert Chrzaniuk)" * tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: tools/power turbosat: update version number tools/power turbostat: SKL: Adjust for TSC difference from base frequency tools/power turbostat: KNL workaround for %Busy and Avg_MHz tools/power turbostat: IVB Xeon: fix --debug regression ACPI / PCI: Remove duplicated penalty on SCI IRQ ACPI, PCI, irq: Do not share PCI IRQ with ISA IRQ ACPI / EC: Fix a memory leak issue in acpi_ec_query() PM / OPP: Fix typo modifcation -> modification PCI / PM: Update runtime PM documentation for PCI devices PM / OPP: of_property_count_u32_elems() can return errors intel_idle: Skylake Client Support - updated

Merge tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki: "These are fixes mostly, for a few changes made in this cycle (the intel_idle driver, the OPP library, the ACPI EC driver, turbostat) and for some issues that have just been discovered (ACPI PCI IRQ management, PCI power management documentation, turbostat), with a couple of cleanups on top of them. Specifics: - intel_idle driver fixup for the recently added Skylake chips support (Len Brown). - Operating Performance Points (OPP) library fix related to the recently added support for new DT bindings and a fix for a typo in a comment (Viresh Kumar, Stephen Boyd). - ACPI EC driver fix for a recently introduced memory leak in an error code path (Lv Zheng). - ACPI PCI IRQ management fix for the issue where an ISA IRQ is shared with a PCI device which requires it to be configured in a different way and may cause an interrupt storm to happen as a result with an extra ACPI SCI IRQ handling simplification on top of it (Jiang Liu). - Update of the PCI power management documentation that became outdated and started to actively confuse the readers to make it actually reflect the code (Rafael J Wysocki). - turbostat fixes including an IVB Xeon regression fix (related to the --debug command line option), Skylake adjustment for the TSC running at a frequency that doesn't match the base one exactly, and a Knights Landing quirk to account for the fact that it only updates APERF and MPERF every 1024 clock cycles plus bumping up the turbostat version number (Len Brown, Hubert Chrzaniuk)" * tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: tools/power turbosat: update version number tools/power turbostat: SKL: Adjust for TSC difference from base frequency tools/power turbostat: KNL workaround for %Busy and Avg_MHz tools/power turbostat: IVB Xeon: fix --debug regression ACPI / PCI: Remove duplicated penalty on SCI IRQ ACPI, PCI, irq: Do not share PCI IRQ with ISA IRQ ACPI / EC: Fix a memory leak issue in acpi_ec_query() PM / OPP: Fix typo modifcation -> modification PCI / PM: Update runtime PM documentation for PCI devices PM / OPP: of_property_count_u32_elems() can return errors intel_idle: Skylake Client Support - updated
1bca1000 · Linus Torvalds · 3deaa4f5 · eb6d1c28 · 1bca1000 · 1bca1000
Commit 1bca1000 authored Oct 01, 2015 by Linus Torvalds
9 changed files
--- a/Documentation/power/pci.txt
+++ b/Documentation/power/pci.txt
@@ -979,20 +979,45 @@ every time right after the runtime_resume() callback has returned
 (alternatively, the runtime_suspend() callback will have to check if the
 device should really be suspended and return -EAGAIN if that is not the case).

-The runtime PM of PCI devices is disabled by default.  It is also blocked by
-pci_pm_init() that runs the pm_runtime_forbid() helper function.  If a PCI
-driver implements the runtime PM callbacks and intends to use the runtime PM
-framework provided by the PM core and the PCI subsystem, it should enable this
-feature by executing the pm_runtime_enable() helper function.  However, the
-driver should not call the pm_runtime_allow() helper function unblocking
-the runtime PM of the device.  Instead, it should allow user space or some
-platform-specific code to do that (user space can do it via sysfs), although
-once it has called pm_runtime_enable(), it must be prepared to handle the
+The runtime PM of PCI devices is enabled by default by the PCI core.  PCI
+device drivers do not need to enable it and should not attempt to do so.
+However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid()
+helper function.  In addition to that, the runtime PM usage counter of
+each PCI device is incremented by local_pci_probe() before executing the
+probe callback provided by the device's driver.
+
+If a PCI driver implements the runtime PM callbacks and intends to use the
+runtime PM framework provided by the PM core and the PCI subsystem, it needs
+to decrement the device's runtime PM usage counter in its probe callback
+function.  If it doesn't do that, the counter will always be different from
+zero for the device and it will never be runtime-suspended.  The simplest
+way to do that is by calling pm_runtime_put_noidle(), but if the driver
+wants to schedule an autosuspend right away, for example, it may call
+pm_runtime_put_autosuspend() instead for this purpose.  Generally, it
+just needs to call a function that decrements the devices usage counter
+from its probe routine to make runtime PM work for the device.
+
+It is important to remember that the driver's runtime_suspend() callback
+may be executed right after the usage counter has been decremented, because
+user space may already have cuased the pm_runtime_allow() helper function
+unblocking the runtime PM of the device to run via sysfs, so the driver must
+be prepared to cope with that.
+
+The driver itself should not call pm_runtime_allow(), though.  Instead, it
+should let user space or some platform-specific code do that (user space can
+do it via sysfs as stated above), but it must be prepared to handle the
 runtime PM of the device correctly as soon as pm_runtime_allow() is called
-(which may happen at any time).  [It also is possible that user space causes
-pm_runtime_allow() to be called via sysfs before the driver is loaded, so in
-fact the driver has to be prepared to handle the runtime PM of the device as
-soon as it calls pm_runtime_enable().]
+(which may happen at any time, even before the driver is loaded).
+
+When the driver's remove callback runs, it has to balance the decrementation
+of the device's runtime PM usage counter at the probe time.  For this reason,
+if it has decremented the counter in its probe callback, it must run
+pm_runtime_get_noresume() in its remove callback.  [Since the core carries
+out a runtime resume of the device and bumps up the device's usage counter
+before running the driver's remove callback, the runtime PM of the device
+is effectively disabled for the duration of the remove execution and all
+runtime PM helper functions incrementing the device's usage counter are
+then effectively equivalent to pm_runtime_get_noresume().]

 The runtime PM framework works by processing requests to suspend or resume
 devices, or to check if they are idle (in which cases it is reasonable to

--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -1044,8 +1044,10 @@ static int acpi_ec_query(struct acpi_ec *ec, u8 *data)
 		goto err_exit;

 	mutex_lock(&ec->mutex);
+	result = -ENODATA;
 	list_for_each_entry(handler, &ec->list, node) {
 		if (value == handler->query_bit) {
+			result = 0;
 			q->handler = acpi_ec_get_query_handler(handler);
 			ec_dbg_evt("Query(0x%02x) scheduled",
 				   q->handler->query_bit);

--- a/drivers/acpi/pci_irq.c
+++ b/drivers/acpi/pci_irq.c
@@ -372,6 +372,7 @@ static int acpi_isa_register_gsi(struct pci_dev *dev)

 	/* Interrupt Line values above 0xF are forbidden */
 	if (dev->irq > 0 && (dev->irq <= 0xF) &&
+	    acpi_isa_irq_available(dev->irq) &&
 	    (acpi_isa_irq_to_gsi(dev->irq, &dev_gsi) == 0)) {
 		dev_warn(&dev->dev, "PCI INT %c: no GSI - using ISA IRQ %d\n",
 			 pin_name(dev->pin), dev->irq);

--- a/drivers/acpi/pci_link.c
+++ b/drivers/acpi/pci_link.c
@@ -498,8 +498,7 @@ int __init acpi_irq_penalty_init(void)
 			    PIRQ_PENALTY_PCI_POSSIBLE;
 		}
 	}
-	/* Add a penalty for the SCI */
-	acpi_irq_penalty[acpi_gbl_FADT.sci_interrupt] += PIRQ_PENALTY_PCI_USING;
+
 	return 0;
 }

@@ -553,6 +552,13 @@ static int acpi_pci_link_allocate(struct acpi_pci_link *link)
 				irq = link->irq.possible[i];
 		}
 	}
+	if (acpi_irq_penalty[irq] >= PIRQ_PENALTY_ISA_ALWAYS) {
+		printk(KERN_ERR PREFIX "No IRQ available for %s [%s]. "
+			    "Try pci=noacpi or acpi=off\n",
+			    acpi_device_name(link->device),
+			    acpi_device_bid(link->device));
+		return -ENODEV;
+	}

 	/* Attempt to enable the link device at this IRQ. */
 	if (acpi_pci_link_set(link, irq)) {
@@ -821,6 +827,12 @@ void acpi_penalize_isa_irq(int irq, int active)
 	}
 }

+bool acpi_isa_irq_available(int irq)
+{
+	return irq >= 0 && (irq >= ARRAY_SIZE(acpi_irq_penalty) ||
+			    acpi_irq_penalty[irq] < PIRQ_PENALTY_ISA_ALWAYS);
+}
+
 /*
 * Penalize IRQ used by ACPI SCI. If ACPI SCI pin attributes conflict with
 * PCI IRQ attributes, mark ACPI SCI as ISA_ALWAYS so it won't be use for

--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -892,10 +892,17 @@ static int opp_get_microvolt(struct dev_pm_opp *opp, struct device *dev)
 	u32 microvolt[3] = {0};
 	int count, ret;

-	count = of_property_count_u32_elems(opp->np, "opp-microvolt");
-	if (!count)
+	/* Missing property isn't a problem, but an invalid entry is */
+	if (!of_find_property(opp->np, "opp-microvolt", NULL))
 		return 0;

+	count = of_property_count_u32_elems(opp->np, "opp-microvolt");
+	if (count < 0) {
+		dev_err(dev, "%s: Invalid opp-microvolt property (%d)\n",
+			__func__, count);
+		return count;
+	}
+
 	/* There can be one or three elements here */
 	if (count != 1 && count != 3) {
 		dev_err(dev, "%s: Invalid number of elements in opp-microvolt property (%d)\n",
@@ -1063,7 +1070,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_add);
 * share a common logic which is isolated here.
 *
 * Return: -EINVAL for bad pointers, -ENOMEM if no memory available for the
- * copy operation, returns 0 if no modifcation was done OR modification was
+ * copy operation, returns 0 if no modification was done OR modification was
 * successful.
 *
 * Locking: The internal device_opp and opp structures are RCU protected.
@@ -1151,7 +1158,7 @@ static int _opp_set_availability(struct device *dev, unsigned long freq,
 * mutex locking or synchronize_rcu() blocking calls cannot be used.
 *
 * Return: -EINVAL for bad pointers, -ENOMEM if no memory available for the
- * copy operation, returns 0 if no modifcation was done OR modification was
+ * copy operation, returns 0 if no modification was done OR modification was
 * successful.
 */
 int dev_pm_opp_enable(struct device *dev, unsigned long freq)
@@ -1177,7 +1184,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_enable);
 * mutex locking or synchronize_rcu() blocking calls cannot be used.
 *
 * Return: -EINVAL for bad pointers, -ENOMEM if no memory available for the
- * copy operation, returns 0 if no modifcation was done OR modification was
+ * copy operation, returns 0 if no modification was done OR modification was
 * successful.
 */
 int dev_pm_opp_disable(struct device *dev, unsigned long freq)

--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -620,7 +620,7 @@ static struct cpuidle_state skl_cstates[] = {
 		.name = "C6-SKL",
 		.desc = "MWAIT 0x20",
 		.flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
-		.exit_latency = 75,
+		.exit_latency = 85,
 		.target_residency = 200,
 		.enter = &intel_idle,
 		.enter_freeze = intel_idle_freeze, },
@@ -636,10 +636,18 @@ static struct cpuidle_state skl_cstates[] = {
 		.name = "C8-SKL",
 		.desc = "MWAIT 0x40",
 		.flags = MWAIT2flg(0x40) | CPUIDLE_FLAG_TLB_FLUSHED,
-		.exit_latency = 174,
+		.exit_latency = 200,
 		.target_residency = 800,
 		.enter = &intel_idle,
 		.enter_freeze = intel_idle_freeze, },
+	{
+		.name = "C9-SKL",
+		.desc = "MWAIT 0x50",
+		.flags = MWAIT2flg(0x50) | CPUIDLE_FLAG_TLB_FLUSHED,
+		.exit_latency = 480,
+		.target_residency = 5000,
+		.enter = &intel_idle,
+		.enter_freeze = intel_idle_freeze, },
 	{
 		.name = "C10-SKL",
 		.desc = "MWAIT 0x60",

--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -299,9 +299,10 @@ static long local_pci_probe(void *_ddi)
 	 * Unbound PCI devices are always put in D0, regardless of
 	 * runtime PM status.  During probe, the device is set to
 	 * active and the usage count is incremented.  If the driver
-	 * supports runtime PM, it should call pm_runtime_put_noidle()
-	 * in its probe routine and pm_runtime_get_noresume() in its
-	 * remove routine.
+	 * supports runtime PM, it should call pm_runtime_put_noidle(),
+	 * or any other runtime PM helper function decrementing the usage
+	 * count, in its probe routine and pm_runtime_get_noresume() in
+	 * its remove routine.
 	 */
 	pm_runtime_get_sync(dev);
 	pci_dev->driver = pci_drv;

--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -217,6 +217,7 @@ struct pci_dev;

 int acpi_pci_irq_enable (struct pci_dev *dev);
 void acpi_penalize_isa_irq(int irq, int active);
+bool acpi_isa_irq_available(int irq);
 void acpi_penalize_sci_irq(int irq, int trigger, int polarity);
 void acpi_pci_irq_disable (struct pci_dev *dev);


--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -71,8 +71,11 @@ unsigned int extra_msr_offset32;
 unsigned int extra_msr_offset64;
 unsigned int extra_delta_offset32;
 unsigned int extra_delta_offset64;
+unsigned int aperf_mperf_multiplier = 1;
 int do_smi;
 double bclk;
+double base_hz;
+double tsc_tweak = 1.0;
 unsigned int show_pkg;
 unsigned int show_core;
 unsigned int show_cpu;
@@ -502,7 +505,7 @@ int format_counters(struct thread_data *t, struct core_data *c,
 	/* %Busy */
 	if (has_aperf) {
 		if (!skip_c0)
-			outp += sprintf(outp, "%8.2f", 100.0 * t->mperf/t->tsc);
+			outp += sprintf(outp, "%8.2f", 100.0 * t->mperf/t->tsc/tsc_tweak);
 		else
 			outp += sprintf(outp, "********");
 	}
@@ -510,7 +513,7 @@ int format_counters(struct thread_data *t, struct core_data *c,
 	/* Bzy_MHz */
 	if (has_aperf)
 		outp += sprintf(outp, "%8.0f",
-			1.0 * t->tsc / units * t->aperf / t->mperf / interval_float);
+			1.0 * t->tsc * tsc_tweak / units * t->aperf / t->mperf / interval_float);

 	/* TSC_MHz */
 	outp += sprintf(outp, "%8.0f", 1.0 * t->tsc/units/interval_float);
@@ -984,6 +987,8 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 			return -3;
 		if (get_msr(cpu, MSR_IA32_MPERF, &t->mperf))
 			return -4;
+		t->aperf = t->aperf * aperf_mperf_multiplier;
+		t->mperf = t->mperf * aperf_mperf_multiplier;
 	}

 	if (do_smi) {
@@ -1149,6 +1154,19 @@ int slv_pkg_cstate_limits[16] = {PCL__0, PCL__1, PCLRSV, PCLRSV, PCL__4, PCLRSV,
 int amt_pkg_cstate_limits[16] = {PCL__0, PCL__1, PCL__2, PCLRSV, PCLRSV, PCLRSV, PCL__6, PCL__7, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV};
 int phi_pkg_cstate_limits[16] = {PCL__0, PCL__2, PCL_6N, PCL_6R, PCLRSV, PCLRSV, PCLRSV, PCLUNL, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV, PCLRSV};

+
+static void
+calculate_tsc_tweak()
+{
+	unsigned long long msr;
+	unsigned int base_ratio;
+
+	get_msr(base_cpu, MSR_NHM_PLATFORM_INFO, &msr);
+	base_ratio = (msr >> 8) & 0xFF;
+	base_hz = base_ratio * bclk * 1000000;
+	tsc_tweak = base_hz / tsc_hz;
+}
+
 static void
 dump_nhm_platform_info(void)
 {
@@ -1926,8 +1944,6 @@ int has_config_tdp(unsigned int family, unsigned int model)

 	switch (model) {
 	case 0x3A:	/* IVB */
-	case 0x3E:	/* IVB Xeon */
-
 	case 0x3C:	/* HSW */
 	case 0x3F:	/* HSX */
 	case 0x45:	/* HSW */
@@ -2543,6 +2559,13 @@ int is_knl(unsigned int family, unsigned int model)
 	return 0;
 }

+unsigned int get_aperf_mperf_multiplier(unsigned int family, unsigned int model)
+{
+	if (is_knl(family, model))
+		return 1024;
+	return 1;
+}
+
 #define SLM_BCLK_FREQS 5
 double slm_freq_table[SLM_BCLK_FREQS] = { 83.3, 100.0, 133.3, 116.7, 80.0};

@@ -2744,6 +2767,9 @@ void process_cpuid()
 		}
 	}

+	if (has_aperf)
+		aperf_mperf_multiplier = get_aperf_mperf_multiplier(family, model);
+
 	do_nhm_platform_info = do_nhm_cstates = do_smi = probe_nhm_msrs(family, model);
 	do_snb_cstates = has_snb_msrs(family, model);
 	do_pc2 = do_snb_cstates && (pkg_cstate_limit >= PCL__2);
@@ -2762,6 +2788,9 @@ void process_cpuid()
 	if (debug)
 		dump_cstate_pstate_config_info();

+	if (has_skl_msrs(family, model))
+		calculate_tsc_tweak();
+
 	return;
 }

@@ -3090,7 +3119,7 @@ int get_and_dump_counters(void)
 }

 void print_version() {
-	fprintf(stderr, "turbostat version 4.7 17-June, 2015"
+	fprintf(stderr, "turbostat version 4.8 26-Sep, 2015"
 		" - Len Brown <lenb@kernel.org>\n");
 }