1. 22 Nov, 2015 2 commits
    • Jesper Dangaard Brouer's avatar
      slub: optimize bulk slowpath free by detached freelist · d0ecd894
      Jesper Dangaard Brouer authored
      This change focus on improving the speed of object freeing in the
      "slowpath" of kmem_cache_free_bulk.
      
      The calls slab_free (fastpath) and __slab_free (slowpath) have been
      extended with support for bulk free, which amortize the overhead of
      the (locked) cmpxchg_double.
      
      To use the new bulking feature, we build what I call a detached
      freelist.  The detached freelist takes advantage of three properties:
      
       1) the free function call owns the object that is about to be freed,
          thus writing into this memory is synchronization-free.
      
       2) many freelist's can co-exist side-by-side in the same slab-page
          each with a separate head pointer.
      
       3) it is the visibility of the head pointer that needs synchronization.
      
      Given these properties, the brilliant part is that the detached
      freelist can be constructed without any need for synchronization.  The
      freelist is constructed directly in the page objects, without any
      synchronization needed.  The detached freelist is allocated on the
      stack of the function call kmem_cache_free_bulk.  Thus, the freelist
      head pointer is not visible to other CPUs.
      
      All objects in a SLUB freelist must belong to the same slab-page.
      Thus, constructing the detached freelist is about matching objects
      that belong to the same slab-page.  The bulk free array is scanned is
      a progressive manor with a limited look-ahead facility.
      
      Kmem debug support is handled in call of slab_free().
      
      Notice kmem_cache_free_bulk no longer need to disable IRQs. This
      only slowed down single free bulk with approx 3 cycles.
      
      Performance data:
       Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
      
      SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
      
      To get stable and comparable numbers, the kernel have been booted with
      "slab_merge" (this also improve performance for larger bulk sizes).
      
      Performance data, compared against fallback bulking:
      
      bulk -  fallback bulk            - improvement with this patch
         1 -  62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
         2 -  55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
         3 -  53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
         4 -  52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
         8 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
        16 -  49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
        30 -  49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
        32 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
        34 -  96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
        48 -  83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
        64 -  74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
       128 -  90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
       158 -  99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
       250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
      
      Performance data, compared current in-kernel bulking:
      
      bulk - curr in-kernel  - improvement with this patch
         1 -  46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
         2 -  27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
         3 -  21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
         4 -  18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
         8 -  17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
        16 -  18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1)  5.6%
        30 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
        32 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
        34 -  78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
        48 -  60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
        64 -  49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
       128 -  69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
       158 -  79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
       250 -  86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
      
      Performance with normal SLUB merging is significantly slower for
      larger bulking.  This is believed to (primarily) be an effect of not
      having to share the per-CPU data-structures, as tuning per-CPU size
      can achieve similar performance.
      
      bulk - slab_nomerge   -  normal SLUB merge
         1 -  49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
         2 -  30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
         3 -  23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
         4 -  20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
         8 -  18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
        16 -  17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
        30 -  18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
        32 -  18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
        34 -  23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
        48 -  21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
        64 -  20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
       128 -  27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
       158 -  30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
       250 -  37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
      
      Joint work with Alexander Duyck.
      
      [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
      
      [akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0ecd894
    • Jesper Dangaard Brouer's avatar
      slub: support for bulk free with SLUB freelists · 81084651
      Jesper Dangaard Brouer authored
      Make it possible to free a freelist with several objects by adjusting API
      of slab_free() and __slab_free() to have head, tail and an objects counter
      (cnt).
      
      Tail being NULL indicate single object free of head object.  This allow
      compiler inline constant propagation in slab_free() and
      slab_free_freelist_hook() to avoid adding any overhead in case of single
      object free.
      
      This allows a freelist with several objects (all within the same
      slab-page) to be free'ed using a single locked cmpxchg_double in
      __slab_free() and with an unlocked cmpxchg_double in slab_free().
      
      Object debugging on the free path is also extended to handle these
      freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
      objects don't belong to the same slab-page.
      
      These changes are needed for the next patch to bulk free the detached
      freelists it introduces and constructs.
      
      Micro benchmarking showed no performance reduction due to this change,
      when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81084651
  2. 21 Nov, 2015 20 commits
  3. 20 Nov, 2015 10 commits
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 400f3f25
      Linus Torvalds authored
      Pull more power management and ACPI updates from Rafael Wysocki:
       "These are mostly fixes and cleanups (ACPI core, PM core, cpufreq, ACPI
        EC driver, device properties) including three reverts of recent
        intel_pstate driver commits due to a regression introduced by one of
        them plus support for Atom Airmont cores in intel_pstate (which really
        boils down to adding new frequency tables for Airmont) and additional
        turbostat updates.
      
        Specifics:
      
         - Revert three recent intel_pstate driver commits one of which
           introduced a regression and the remaining two depend on the
           problematic one (Rafael Wysocki).
      
         - Fix breakage related to the recently introduced ACPI _CCA object
           support in the PCI DMA setup code (Suravee Suthikulpanit).
      
         - Fix up the recently introduced ACPI CPPC support to only use the
           hardware-reduced version of the PCCT structure as the only
           architecture to support it (ARM64) will only use hardware-reduced
           ACPI anyway (Ashwin Chaugule).
      
         - Fix a cpufreq mediatek driver build problem (Arnd Bergmann).
      
         - Fix the SMBus transaction handling implementation in the ACPI core
           to avoid re-entrant calls to wait_event_timeout() which makes
           intermittent boot stalls related to the Smart Battery Subsystem
           initialization go away and revert a workaround of another problem
           with the same underlying root cause (Chris Bainbridge).
      
         - Fix the generic wakeup interrupts framework to avoid using invalid
           IRQ numbers (Dmitry Torokhov).
      
         - Remove a redundant check from the ACPI EC driver (Markus Elfring).
      
         - Modify the intel_pstate driver so it can support more Atom flavors
           than just one (Baytrail) and add support for Atom Airmont cores
           (which require new freqnency tables) to it (Philippe Longepe).
      
         - Clean up MSR-related symbols in turbostat (Len Brown)"
      
      * tag 'pm+acpi-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PCI: Fix OF logic in pci_dma_configure()
        Revert "Documentation: kernel_parameters for Intel P state driver"
        cpufreq: mediatek: fix build error
        cpufreq: intel_pstate: Add separate support for Airmont cores
        cpufreq: intel_pstate: Replace BYT with ATOM
        Revert "cpufreq: intel_pstate: Use ACPI perf configuration"
        Revert "cpufreq: intel_pstate: Avoid calculation for max/min"
        ACPI-EC: Drop unnecessary check made before calling acpi_ec_delete_query()
        Revert "ACPI / SBS: Add 5 us delay to fix SBS hangs on MacBook"
        ACPI / SMBus: Fix boot stalls / high CPU caused by reentrant code
        PM / wakeirq: check that wake IRQ is valid before accepting it
        ACPI / CPPC: Use h/w reduced version of the PCCT structure
        x86: remove unused definition of MSR_NHM_PLATFORM_INFO
        tools/power turbostat: use new name for MSR_PLATFORM_INFO
      400f3f25
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 2f255351
      Linus Torvalds authored
      Pull powerpc fixlet from Michael Ellerman:
       "Wire up sys_mlock2()"
      
      * tag 'powerpc-4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc: Wire up sys_mlock2()
      2f255351
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-4.4-rc2' of git://git.infradead.org/users/vkoul/slave-dma · 86eaf54d
      Linus Torvalds authored
      Pull dmaengine fixes from Vinod Koul:
       "This has odd fixes spreadout drivers, not major here
      
         - usbdmac fixes for pm
         - edma build and logic fixes
         - build warn fixes for few drivers"
      
      * tag 'dmaengine-fix-4.4-rc2' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: at_hdmac: use %pad format string for dma_addr_t
        dmaengine: at_xdmac: use %pad format string for dma_addr_t
        dmaengine: imx-sdma: remove __init annotation on sdma_event_remap
        dmaengine: edma: predecence bug in GET_NUM_QDMACH()
        dmaengine: edma: fix build without CONFIG_OF
        dmaengine: of_dma: Correct return code for of_dma_request_slave_channel in case !CONFIG_OF
        dmaengine: sh: usb-dmac: Fix pm_runtime_{enable,disable}() imbalance
        dmaengine: sh: usb-dmac: Fix crash on runtime suspend
      86eaf54d
    • Linus Torvalds's avatar
      Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · c69bde78
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "A varied bunch of fixes, the radeon pull is probably a bit larger than
        I'd like, but it contains 2 weeks of stuff, and the Fiji fixes are a
        bit large, but they are Fiji specific.
      
        Otherwise:
      
         - mgag200: One cursor regression oops fix.
         - vc4: A few small fixes and cleanups.
         - core: Atomic fixes and Atomic helper fixes
         - i915: Revert for the backlight regression along with a bunch of
           fixes"
      
      * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (58 commits)
        drm/atomic-helper: Check encoder/crtc constraints
        Revert "drm/i915: skip modeset if compatible for everyone."
        drm/mgag200: fix kernel hang in cursor code.
        drm/amdgpu: reserve/unreserve objects out of map/unmap operations
        drm/amdgpu: move bo_reserve out of amdgpu_vm_clear_bo
        drm/amdgpu: add lock for interval tree in vm
        drm/amdgpu: keep the owner for VMIDs
        drm/amdgpu: move VM manager clean into the VM code again
        drm/amdgpu: cleanup VM coding style
        drm/amdgpu: remove unused VM manager field
        drm/amdgpu: cleanup scheduler command submission
        drm/amdgpu: fix typo in firmware name
        drm/i915: Consider SPLL as another shared pll, v2.
        drm/i915: Fix gpu frequency change tracing
        drm/vc4: Make sure that planes aren't scaled.
        drm/vc4: Fix some failure to track __iomem decorations on pointers.
        drm/vc4: checking for NULL instead of IS_ERR
        drm/vc4: fix itnull.cocci warnings
        drm/vc4: fix platform_no_drv_owner.cocci warnings
        drm/vc4: vc4_plane_duplicate_state() can be static
        ...
      c69bde78
    • Linus Torvalds's avatar
      Merge tag 'for-linus-4.4' of git://git.code.sf.net/p/openipmi/linux-ipmi · cd6caf55
      Linus Torvalds authored
      Pull IPMI updates from Corey Minyard:
       "Some fixes for small IPMI problems.
      
        The most significant is that the driver wasn't starting the timer for
        some messages, which would result in problems if that message failed
        for some reason.
      
        The others are small optimizations or making things a little neater"
      
      * tag 'for-linus-4.4' of git://git.code.sf.net/p/openipmi/linux-ipmi:
        ipmi watchdog : add panic_wdt_timeout parameter
        char: ipmi: Move MODULE_DEVICE_TABLE() to follow struct
        ipmi: Stop the timer immediately if idle
        ipmi: Start the timer and thread on internal msgs
      cd6caf55
    • Linus Torvalds's avatar
      Merge tag 'renesas-sh-drivers-for-v4.4' of... · 8bdddfae
      Linus Torvalds authored
      Merge tag 'renesas-sh-drivers-for-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas
      
      Pull SH driver fixlet from Simon Horman:
       "I am sending this change after v4.4-rc1 has been released as it
        depends on SoC changes which are present in that rc:
      
         = Remove now unnecessary reference to CONFIG_ARCH_SHMOBILE_MULTI"
      
      * tag 'renesas-sh-drivers-for-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas:
        drivers: sh: Get rid of CONFIG_ARCH_SHMOBILE_MULTI
      8bdddfae
    • Rafael J. Wysocki's avatar
      Merge branches 'acpi-smbus', 'acpi-ec' and 'acpi-pci' · a3767e3c
      Rafael J. Wysocki authored
      * acpi-smbus:
        Revert "ACPI / SBS: Add 5 us delay to fix SBS hangs on MacBook"
        ACPI / SMBus: Fix boot stalls / high CPU caused by reentrant code
      
      * acpi-ec:
        ACPI-EC: Drop unnecessary check made before calling acpi_ec_delete_query()
      
      * acpi-pci:
        PCI: Fix OF logic in pci_dma_configure()
      a3767e3c
    • Rafael J. Wysocki's avatar
      Merge branch 'pm-sleep' · 0aba0ab8
      Rafael J. Wysocki authored
      * pm-sleep:
        PM / wakeirq: check that wake IRQ is valid before accepting it
      0aba0ab8
    • Rafael J. Wysocki's avatar
      Merge branches 'pm-cpufreq' and 'acpi-cppc' · 9832bf3a
      Rafael J. Wysocki authored
      * pm-cpufreq:
        Revert "Documentation: kernel_parameters for Intel P state driver"
        cpufreq: mediatek: fix build error
        cpufreq: intel_pstate: Add separate support for Airmont cores
        cpufreq: intel_pstate: Replace BYT with ATOM
        Revert "cpufreq: intel_pstate: Use ACPI perf configuration"
        Revert "cpufreq: intel_pstate: Avoid calculation for max/min"
      
      * acpi-cppc:
        ACPI / CPPC: Use h/w reduced version of the PCCT structure
      9832bf3a
    • Suravee Suthikulpanit's avatar
      PCI: Fix OF logic in pci_dma_configure() · 768acd64
      Suravee Suthikulpanit authored
      This patch fixes a bug introduced by previous commit,
      which incorrectly checkes the of_node of the end-point device.
      Instead, it should check the of_node of the host bridge.
      
      Fixes: 50230713 ("PCI: OF: Move of_pci_dma_configure() to pci_dma_configure()")
      Reported-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      768acd64
  4. 19 Nov, 2015 8 commits