1. 06 Feb, 2018 14 commits
    • Ingo Molnar's avatar
      Merge branch 'linus' into sched/urgent, to resolve conflicts · 82845079
      Ingo Molnar authored
       Conflicts:
      	arch/arm64/kernel/entry.S
      	arch/x86/Kconfig
      	include/linux/sched/mm.h
      	kernel/fork.c
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      82845079
    • Linus Torvalds's avatar
      Merge tag 'media/v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 68c5735e
      Linus Torvalds authored
      Pull media updates from Mauro Carvalho Chehab:
      
       - videobuf2 was moved to a media/common dir, as it is now used by the
         DVB subsystem too
      
       - Digital TV core memory mapped support interface
      
       - new sensor driver: ov7740
      
       - several improvements at ddbridge driver
      
       - new V4L2 driver: IPU3 CIO2 CSI-2 receiver unit, found on some Intel
         SoCs
      
       - new tuner driver: tda18250
      
       - finally got rid of all LIRC staging drivers
      
       - as we don't have old lirc drivers anymore, restruct the lirc device
         code
      
       - add support for UVC metadata
      
       - add a new staging driver for NVIDIA Tegra Video Decoder Engine
      
       - DVB kAPI headers moved to include/media
      
       - synchronize the kAPI and uAPI for the DVB subsystem, removing the gap
         for non-legacy APIs
      
       - reduce the kAPI gap for V4L2
      
       - lots of other driver enhancements, cleanups, etc.
      
      * tag 'media/v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (407 commits)
        media: v4l2-compat-ioctl32.c: make ctrl_is_pointer work for subdevs
        media: v4l2-compat-ioctl32.c: refactor compat ioctl32 logic
        media: v4l2-compat-ioctl32.c: don't copy back the result for certain errors
        media: v4l2-compat-ioctl32.c: drop pr_info for unknown buffer type
        media: v4l2-compat-ioctl32.c: copy clip list in put_v4l2_window32
        media: v4l2-compat-ioctl32.c: fix ctrl_is_pointer
        media: v4l2-compat-ioctl32.c: copy m.userptr in put_v4l2_plane32
        media: v4l2-compat-ioctl32.c: avoid sizeof(type)
        media: v4l2-compat-ioctl32.c: move 'helper' functions to __get/put_v4l2_format32
        media: v4l2-compat-ioctl32.c: fix the indentation
        media: v4l2-compat-ioctl32.c: add missing VIDIOC_PREPARE_BUF
        media: v4l2-ioctl.c: don't copy back the result for -ENOTTY
        media: v4l2-ioctl.c: use check_fmt for enum/g/s/try_fmt
        media: vivid: fix module load error when enabling fb and no_error_inj=1
        media: dvb_demux: improve debug messages
        media: dvb_demux: Better handle discontinuity errors
        media: cxusb, dib0700: ignore XC2028_I2C_FLUSH
        media: ts2020: avoid integer overflows on 32 bit machines
        media: i2c: ov7740: use gpio/consumer.h instead of gpio.h
        media: entity: Add a nop variant of media_entity_cleanup
        ...
      68c5735e
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 2246edfa
      Linus Torvalds authored
      Pull more rdma updates from Doug Ledford:
       "Items of note:
      
         - two patches fix a regression in the 4.15 kernel. The 4.14 kernel
           worked fine with NVMe over Fabrics and mlx5 adapters. That broke in
           4.15. The fix is here.
      
         - one of the patches (the endian notation patch from Lijun) looks
           like a lot of lines of change, but it's mostly mechanical in
           nature. It amounts to the biggest chunk of change in it (it's about
           2/3rds of the overall pull request).
      
        Summary:
      
         - Clean up some function signatures in rxe for clarity
      
         - Tidy the RDMA netlink header to remove unimplemented constants
      
         - bnxt_re driver fixes, one is a regression this window.
      
         - Minor hns driver fixes
      
         - Various fixes from Dan Carpenter and his tool
      
         - Fix IRQ cleanup race in HFI1
      
         - HF1 performance optimizations and a fix to report counters in the right units
      
         - Fix for an IPoIB startup sequence race with the external manager
      
         - Oops fix for the new kabi path
      
         - Endian cleanups for hns
      
         - Fix for mlx5 related to the new automatic affinity support"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (38 commits)
        net/mlx5: increase async EQ to avoid EQ overrun
        mlx5: fix mlx5_get_vector_affinity to start from completion vector 0
        RDMA/hns: Fix the endian problem for hns
        IB/uverbs: Use the standard kConfig format for experimental
        IB: Update references to libibverbs
        IB/hfi1: Add 16B rcvhdr trace support
        IB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node
        IB/core: Avoid a potential OOPs for an unused optional parameter
        IB/core: Map iWarp AH type to undefined in rdma_ah_find_type
        IB/ipoib: Fix for potential no-carrier state
        IB/hfi1: Show fault stats in both TX and RX directions
        IB/hfi1: Remove blind constants from 16B update
        IB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times
        IB/hfi1: Do not override given pcie_pset value
        IB/hfi1: Optimize process_receive_ib()
        IB/hfi1: Remove unnecessary fecn and becn fields
        IB/hfi1: Look up ibport using a pointer in receive path
        IB/hfi1: Optimize packet type comparison using 9B and bypass code paths
        IB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet
        IB/hfi1: Remove dependence on qp->s_hdrwords
        ...
      2246edfa
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 3ff1b28c
      Linus Torvalds authored
      Pull libnvdimm updates from Ross Zwisler:
      
       - Require struct page by default for filesystem DAX to remove a number
         of surprising failure cases. This includes failures with direct I/O,
         gdb and fork(2).
      
       - Add support for the new Platform Capabilities Structure added to the
         NFIT in ACPI 6.2a. This new table tells us whether the platform
         supports flushing of CPU and memory controller caches on unexpected
         power loss events.
      
       - Revamp vmem_altmap and dev_pagemap handling to clean up code and
         better support future future PCI P2P uses.
      
       - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
         become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
         spec, and instead rely on the generic ND_CMD_CALL approach used by
         the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.
      
       - Enhance nfit_test so we can test some of the new things added in
         version 1.6 of the DSM specification. This includes testing firmware
         download and simulating the Last Shutdown State (LSS) status.
      
      * tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
        libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
        acpi, nfit: fix register dimm error handling
        libnvdimm, namespace: make min namespace size 4K
        tools/testing/nvdimm: force nfit_test to depend on instrumented modules
        libnvdimm/nfit_test: adding support for unit testing enable LSS status
        libnvdimm/nfit_test: add firmware download emulation
        nfit-test: Add platform cap support from ACPI 6.2a to test
        libnvdimm: expose platform persistence attribute for nd_region
        acpi: nfit: add persistent memory control flag for nd_region
        acpi: nfit: Add support for detect platform CPU cache flush on power loss
        device-dax: Fix trailing semicolon
        libnvdimm, btt: fix uninitialized err_lock
        dax: require 'struct page' by default for filesystem dax
        ext2: auto disable dax instead of failing mount
        ext4: auto disable dax instead of failing mount
        mm, dax: introduce pfn_t_special()
        mm: Fix devm_memremap_pages() collision handling
        mm: Fix memory size alignment in devm_memremap_pages_release()
        memremap: merge find_dev_pagemap into get_dev_pagemap
        memremap: change devm_memremap_pages interface to use struct dev_pagemap
        ...
      3ff1b28c
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 105cf3c8
      Linus Torvalds authored
      Pull PCI updates from Bjorn Helgaas:
      
       - skip AER driver error recovery callbacks for correctable errors
         reported via ACPI APEI, as we already do for errors reported via the
         native path (Tyler Baicar)
      
       - fix DPC shared interrupt handling (Alex Williamson)
      
       - print full DPC interrupt number (Keith Busch)
      
       - enable DPC only if AER is available (Keith Busch)
      
       - simplify DPC code (Bjorn Helgaas)
      
       - calculate ASPM L1 substate parameter instead of hardcoding it (Bjorn
         Helgaas)
      
       - enable Latency Tolerance Reporting for ASPM L1 substates (Bjorn
         Helgaas)
      
       - move ASPM internal interfaces out of public header (Bjorn Helgaas)
      
       - allow hot-removal of VGA devices (Mika Westerberg)
      
       - speed up unplug and shutdown by assuming Thunderbolt controllers
         don't support Command Completed events (Lukas Wunner)
      
       - add AtomicOps support for GPU and Infiniband drivers (Felix Kuehling,
         Jay Cornwall)
      
       - expose "ari_enabled" in sysfs to help NIC naming (Stuart Hayes)
      
       - clean up PCI DMA interface usage (Christoph Hellwig)
      
       - remove PCI pool API (replaced with DMA pool) (Romain Perier)
      
       - deprecate pci_get_bus_and_slot(), which assumed PCI domain 0 (Sinan
         Kaya)
      
       - move DT PCI code from drivers/of/ to drivers/pci/ (Rob Herring)
      
       - add PCI-specific wrappers for dev_info(), etc (Frederick Lawler)
      
       - remove warnings on sysfs mmap failure (Bjorn Helgaas)
      
       - quiet ROM validation messages (Alex Deucher)
      
       - remove redundant memory alloc failure messages (Markus Elfring)
      
       - fill in types for compile-time VGA and other I/O port resources
         (Bjorn Helgaas)
      
       - make "pci=pcie_scan_all" work for Root Ports as well as Downstream
         Ports to help AmigaOne X1000 (Bjorn Helgaas)
      
       - add SPDX tags to all PCI files (Bjorn Helgaas)
      
       - quirk Marvell 9128 DMA aliases (Alex Williamson)
      
       - quirk broken INTx disable on Ceton InfiniTV4 (Bjorn Helgaas)
      
       - fix CONFIG_PCI=n build by adding dummy pci_irqd_intx_xlate() (Niklas
         Cassel)
      
       - use DMA API to get MSI address for DesignWare IP (Niklas Cassel)
      
       - fix endpoint-mode DMA mask configuration (Kishon Vijay Abraham I)
      
       - fix ARTPEC-6 incorrect IS_ERR() usage (Wei Yongjun)
      
       - add support for ARTPEC-7 SoC (Niklas Cassel)
      
       - add endpoint-mode support for ARTPEC (Niklas Cassel)
      
       - add Cadence PCIe host and endpoint controller driver (Cyrille
         Pitchen)
      
       - handle multiple INTx status bits being set in dra7xx (Vignesh R)
      
       - translate dra7xx hwirq range to fix INTD handling (Vignesh R)
      
       - remove deprecated Exynos PHY initialization code (Jaehoon Chung)
      
       - fix MSI erratum workaround for HiSilicon Hip06/Hip07 (Dongdong Liu)
      
       - fix NULL pointer dereference in iProc BCMA driver (Ray Jui)
      
       - fix Keystone interrupt-controller-node lookup (Johan Hovold)
      
       - constify qcom driver structures (Julia Lawall)
      
       - rework Tegra config space mapping to increase space available for
         endpoints (Vidya Sagar)
      
       - simplify Tegra driver by using bus->sysdata (Manikanta Maddireddy)
      
       - remove PCI_REASSIGN_ALL_BUS usage on Tegra (Manikanta Maddireddy)
      
       - add support for Global Fabric Manager Server (GFMS) event to
         Microsemi Switchtec switch driver (Logan Gunthorpe)
      
       - add IDs for Switchtec PSX 24xG3 and PSX 48xG3 (Kelvin Cao)
      
      * tag 'pci-v4.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
        PCI: cadence: Add EndPoint Controller driver for Cadence PCIe controller
        dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe endpoint controller
        PCI: endpoint: Fix EPF device name to support multi-function devices
        PCI: endpoint: Add the function number as argument to EPC ops
        PCI: cadence: Add host driver for Cadence PCIe controller
        dt-bindings: PCI: cadence: Add DT bindings for Cadence PCIe host controller
        PCI: Add vendor ID for Cadence
        PCI: Add generic function to probe PCI host controllers
        PCI: generic: fix missing call of pci_free_resource_list()
        PCI: OF: Add generic function to parse and allocate PCI resources
        PCI: Regroup all PCI related entries into drivers/pci/Makefile
        PCI/DPC: Reformat DPC register definitions
        PCI/DPC: Add and use DPC Status register field definitions
        PCI/DPC: Squash dpc_rp_pio_get_info() into dpc_process_rp_pio_error()
        PCI/DPC: Remove unnecessary RP PIO register structs
        PCI/DPC: Push dpc->rp_pio_status assignment into dpc_rp_pio_get_info()
        PCI/DPC: Squash dpc_rp_pio_print_error() into dpc_rp_pio_get_info()
        PCI/DPC: Make RP PIO log size check more generic
        PCI/DPC: Rename local "status" to "dpc_status"
        PCI/DPC: Squash dpc_rp_pio_print_tlp_header() into dpc_rp_pio_print_error()
        ...
      105cf3c8
    • Mel Gorman's avatar
      sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS · 32e839dd
      Mel Gorman authored
      The select_idle_sibling() (SIS) rewrite in commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... replaced a domain iteration with a search that broadly speaking
      does a wrapped walk of the scheduler domain sharing a last-level-cache.
      
      While this had a number of improvements, one consequence is that two tasks
      that share a waker/wakee relationship push each other around a socket. Even
      though two tasks may be active, all cores are evenly used. This is great from
      a search perspective and spreads a load across individual cores, but it has
      adverse consequences for cpufreq. As each CPU has relatively low utilisation,
      cpufreq may decide the utilisation is too low to used a higher P-state and
      overall computation throughput suffers.
      
      While individual cpufreq and cpuidle drivers may compensate by artifically
      boosting P-state (at c0) or avoiding lower C-states (during idle), it does
      not help if hardware-based cpufreq (e.g. HWP) is used.
      
      This patch tracks a recently used CPU based on what CPU a task was running
      on when it last was a waker a CPU it was recently using when a task is a
      wakee. During SIS, the recently used CPU is used as a target if it's still
      allowed by the task and is idle.
      
      The benefit may be non-obvious so consider an example of two tasks
      communicating back and forth. Task A may be an application doing IO where
      task B is a kworker or kthread like journald. Task A may issue IO, wake
      B and B wakes up A on completion.  With the existing scheme this may look
      like the following (potentially different IDs if SMT is in use but similar
      principal applies).
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 2)
       A (cpu 2)	wake	B (wakes on cpu 3)
       etc.
      
      A careful reader may wonder why CPU 0 was not idle when B wakes A the
      first time and it's simply due to the fact that A can be rescheduled to
      another CPU and the pattern is that prev == target when B tries to wakeup A
      and the information about CPU 0 has been lost.
      
      With this patch, the pattern is more likely to be:
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 0)
       A (cpu 0)	wake	B (wakes on cpu 1)
       etc
      
      i.e. two communicating casts are more likely to use just two cores instead
      of all available cores sharing a LLC.
      
      The most dramatic speedup was noticed on dbench using the XFS filesystem on
      UMA as clients interact heavily with workqueues in that configuration. Note
      that a similar speedup is not observed on ext4 as the wakeup pattern
      is different:
      
                                4.15.0-rc9             4.15.0-rc9
                                 waprev-v1        biasancestor-v1
       Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
       Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
       Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
       Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
       Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)
      
      The results can be less dramatic on NUMA where automatic balancing interferes
      with the test. It's also known that network benchmarks running on localhost
      also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
      and TCP depending on the machine). Hackbench also seens small improvements
      (6-11% depending on machine and thread count). The facebook schbench was also
      tested but in most cases showed little or no different to wakeup latencies.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      32e839dd
    • Mel Gorman's avatar
      sched/fair: Do not migrate if the prev_cpu is idle · 806486c3
      Mel Gorman authored
      wake_affine_idle() prefers to move a task to the current CPU if the
      wakeup is due to an interrupt. The expectation is that the interrupt
      data is cache hot and relevant to the waking task as well as avoiding
      a search. However, there is no way to determine if there was cache hot
      data on the previous CPU that may exceed the interrupt data. Furthermore,
      round-robin delivery of interrupts can migrate tasks around a socket where
      each CPU is under-utilised.  This can interact badly with cpufreq which
      makes decisions based on per-cpu data. It has been observed on machines
      with HWP that p-states are not boosted to their maximum levels even though
      the workload is latency and throughput sensitive.
      
      This patch uses the previous CPU for the task if it's idle and cache-affine
      with the current CPU even if the current CPU is idle due to the wakup
      being related to the interrupt. This reduces migrations at the cost of
      the interrupt data not being cache hot when the task wakes.
      
      A variety of workloads were tested on various machines and no adverse
      impact was noticed that was outside noise. dbench on ext4 on UMA showed
      roughly 10% reduction in the number of CPU migrations and it is a case
      where interrupts are frequent for IO competions. In most cases, the
      difference in performance is quite small but variability is often
      reduced. For example, this is the result for pgbench running on a UMA
      machine with different numbers of clients.
      
                                4.15.0-rc9             4.15.0-rc9
                                  baseline              waprev-v1
       Hmean     1     22096.28 (   0.00%)    22734.86 (   2.89%)
       Hmean     4     74633.42 (   0.00%)    75496.77 (   1.16%)
       Hmean     7    115017.50 (   0.00%)   113030.81 (  -1.73%)
       Hmean     12   126209.63 (   0.00%)   126613.40 (   0.32%)
       Hmean     16   131886.91 (   0.00%)   130844.35 (  -0.79%)
       Stddev    1       636.38 (   0.00%)      417.11 (  34.46%)
       Stddev    4       614.64 (   0.00%)      583.24 (   5.11%)
       Stddev    7       542.46 (   0.00%)      435.45 (  19.73%)
       Stddev    12      173.93 (   0.00%)      171.50 (   1.40%)
       Stddev    16      671.42 (   0.00%)      680.30 (  -1.32%)
       CoeffVar  1         2.88 (   0.00%)        1.83 (  36.26%)
      
      Note that the different in performance is marginal but for low utilisation,
      there is less variability.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      806486c3
    • Mel Gorman's avatar
      sched/fair: Restructure wake_affine*() to return a CPU id · 3b76c4a3
      Mel Gorman authored
      This is a preparation patch that has wake_affine*() return a CPU ID instead of
      a boolean. The intent is to allow the wake_affine() helpers to be avoided
      if a decision is already made. This patch has no functional change.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3b76c4a3
    • Mel Gorman's avatar
      sched/fair: Remove unnecessary parameters from wake_affine_idle() · 89a55f56
      Mel Gorman authored
      wake_affine_idle() takes parameters it never uses so clean it up.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      89a55f56
    • Wen Yang's avatar
      sched/rt: Make update_curr_rt() more accurate · e7ad2031
      Wen Yang authored
      rq->clock_task may be updated between the two calls of
      rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
      once makes it more accurate and efficient, taking update_curr() as
      reference.
      Signed-off-by: default avatarWen Yang <wen.yang99@zte.com.cn>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cnSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e7ad2031
    • Steven Rostedt (VMware)'s avatar
      sched/rt: Up the root domain ref count when passing it around via IPIs · 364f5665
      Steven Rostedt (VMware) authored
      When issuing an IPI RT push, where an IPI is sent to each CPU that has more
      than one RT task scheduled on it, it references the root domain's rto_mask,
      that contains all the CPUs within the root domain that has more than one RT
      task in the runable state. The problem is, after the IPIs are initiated, the
      rq->lock is released. This means that the root domain that is associated to
      the run queue could be freed while the IPIs are going around.
      
      Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
      the root domain's ref count respectively. This way when initiating the IPIs,
      the scheduler will up the root domain's ref count before releasing the
      rq->lock, ensuring that the root domain does not go away until the IPI round
      is complete.
      Reported-by: default avatarPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      364f5665
    • Steven Rostedt (VMware)'s avatar
      sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() · ad0f1d9d
      Steven Rostedt (VMware) authored
      When the rto_push_irq_work_func() is called, it looks at the RT overloaded
      bitmask in the root domain via the runqueue (rq->rd). The problem is that
      during CPU up and down, nothing here stops rq->rd from changing between
      taking the rq->rd->rto_lock and releasing it. That means the lock that is
      released is not the same lock that was taken.
      
      Instead of using this_rq()->rd to get the root domain, as the irq work is
      part of the root domain, we can simply get the root domain from the irq work
      that is passed to the routine:
      
       container_of(work, struct root_domain, rto_push_work)
      
      This keeps the root domain consistent.
      Reported-by: default avatarPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ad0f1d9d
    • Peter Zijlstra's avatar
      sched/core: Optimize update_stats_*() · 2ed41a55
      Peter Zijlstra authored
      These functions are already gated by schedstats_enabled(), there is no
      point in then issuing another static_branch for every individual
      update in them.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2ed41a55
    • Peter Zijlstra's avatar
      sched/core: Optimize ttwu_stat() · b85c8b71
      Peter Zijlstra authored
      The whole of ttwu_stat() is guarded by a single schedstat_enabled(),
      there is absolutely no point in then issuing another static_branch for
      every single schedstat_inc() in there.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b85c8b71
  2. 05 Feb, 2018 26 commits
    • Linus Torvalds's avatar
      Merge tag 'xfs-4.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · e237f98a
      Linus Torvalds authored
      Pull more xfs updates from Darrick Wong:
       "As promised, here's a (much smaller) second pull request for the
        second week of the merge cycle. This time around we have a couple
        patches shutting off unsupported fs configurations, and a couple of
        cleanups.
      
        Last, we turn off EXPERIMENTAL for the reverse mapping btree, since
        the primary downstream user of that information (online fsck) is now
        upstream and I haven't seen any major failures in a few kernel
        releases.
      
        Summary:
      
         - Print scrub build status in the xfs build info.
      
         - Explicitly call out the remaining two scenarios where we don't
           support reflink and never have.
      
         - Remove EXPERIMENTAL tag from reverse mapping btree!"
      
      * tag 'xfs-4.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: remove experimental tag for reverse mapping
        xfs: don't allow reflink + realtime filesystems
        xfs: don't allow DAX on reflink filesystems
        xfs: add scrub to XFS_BUILD_OPTIONS
        xfs: fix u32 type usage in sb validation function
      e237f98a
    • Linus Torvalds's avatar
      Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs · 139351f1
      Linus Torvalds authored
      Pull overlayfs updates from Miklos Szeredi:
       "This work from Amir adds NFS export capability to overlayfs. NFS
        exporting an overlay filesystem is a challange because we want to keep
        track of any copy-up of a file or directory between encoding the file
        handle and decoding it.
      
        This is achieved by indexing copied up objects by lower layer file
        handle. The index is already used for hard links, this patchset
        extends the use to NFS file handle decoding"
      
      * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (51 commits)
        ovl: check ERR_PTR() return value from ovl_encode_fh()
        ovl: fix regression in fsnotify of overlay merge dir
        ovl: wire up NFS export operations
        ovl: lookup indexed ancestor of lower dir
        ovl: lookup connected ancestor of dir in inode cache
        ovl: hash non-indexed dir by upper inode for NFS export
        ovl: decode pure lower dir file handles
        ovl: decode indexed dir file handles
        ovl: decode lower file handles of unlinked but open files
        ovl: decode indexed non-dir file handles
        ovl: decode lower non-dir file handles
        ovl: encode lower file handles
        ovl: copy up before encoding non-connectable dir file handle
        ovl: encode non-indexed upper file handles
        ovl: decode connected upper dir file handles
        ovl: decode pure upper file handles
        ovl: encode pure upper file handles
        ovl: document NFS export
        vfs: factor out helpers d_instantiate_anon() and d_alloc_anon()
        ovl: store 'has_upper' and 'opaque' as bit flags
        ...
      139351f1
    • Mathieu Desnoyers's avatar
      membarrier/selftest: Test private expedited sync core command · 460e8c33
      Mathieu Desnoyers authored
      Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
      MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarShuah Khan <shuahkh@osg.samsung.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alice Ferrazzi <alice.ferrazzi@gmail.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Elder <paul.elder@pitt.edu>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-12-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      460e8c33
    • Mathieu Desnoyers's avatar
      membarrier/arm64: Provide core serializing command · f1e3a12b
      Mathieu Desnoyers authored
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-11-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f1e3a12b
    • Mathieu Desnoyers's avatar
      membarrier/x86: Provide core serializing command · 10bcc80e
      Mathieu Desnoyers authored
      There are two places where core serialization is needed by membarrier:
      
      1) When returning from the membarrier IPI,
      2) After scheduler updates curr to a thread with a different mm, before
         going back to user-space, since the curr->mm is used by membarrier to
         check whether it needs to send an IPI to that CPU.
      
      x86-32 uses IRET as return from interrupt, and both IRET and SYSEXIT to go
      back to user-space. The IRET instruction is core serializing, but not
      SYSEXIT.
      
      x86-64 uses IRET as return from interrupt, which takes care of the IPI.
      However, it can return to user-space through either SYSRETL (compat
      code), SYSRETQ, or IRET. Given that SYSRET{L,Q} is not core serializing,
      we rely instead on write_cr3() performed by switch_mm() to provide core
      serialization after changing the current mm, and deal with the special
      case of kthread -> uthread (temporarily keeping current mm into
      active_mm) by adding a sync_core() in that specific case.
      
      Use the new sync_core_before_usermode() to guarantee this.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-10-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      10bcc80e
    • Mathieu Desnoyers's avatar
      membarrier: Provide core serializing command, *_SYNC_CORE · 70216e18
      Mathieu Desnoyers authored
      Provide core serializing membarrier command to support memory reclaim
      by JIT.
      
      Each architecture needs to explicitly opt into that support by
      documenting in their architecture code how they provide the core
      serializing instructions required when returning from the membarrier
      IPI, and after the scheduler has updated the curr->mm pointer (before
      going back to user-space). They should then select
      ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
      their architecture.
      
      Architectures selecting this feature need to either document that
      they issue core serializing instructions when returning to user-space,
      or implement their architecture-specific sync_core_before_usermode().
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      70216e18
    • Mathieu Desnoyers's avatar
      lockin/x86: Implement sync_core_before_usermode() · ac1ab12a
      Mathieu Desnoyers authored
      Ensure that a core serializing instruction is issued before returning to
      user-mode. x86 implements return to user-space through sysexit, sysrel,
      and sysretq, which are not core serializing.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-8-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ac1ab12a
    • Mathieu Desnoyers's avatar
      locking: Introduce sync_core_before_usermode() · e61938a9
      Mathieu Desnoyers authored
      Introduce an architecture function that ensures the current CPU
      issues a core serializing instruction before returning to usermode.
      
      This is needed for the membarrier "sync_core" command.
      
      Architectures defining the sync_core_before_usermode() static inline
      need to select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-7-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e61938a9
    • Mathieu Desnoyers's avatar
      membarrier/selftest: Test global expedited command · 92485487
      Mathieu Desnoyers authored
      Test the new MEMBARRIER_CMD_GLOBAL_EXPEDITED and
      MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED commands.
      
      Adapt to the MEMBARRIER_CMD_SHARED -> MEMBARRIER_CMD_GLOBAL rename.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarShuah Khan <shuahkh@osg.samsung.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alice Ferrazzi <alice.ferrazzi@gmail.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Elder <paul.elder@pitt.edu>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-6-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      92485487
    • Mathieu Desnoyers's avatar
      membarrier: Provide GLOBAL_EXPEDITED command · c5f58bd5
      Mathieu Desnoyers authored
      Allow expedited membarrier to be used for data shared between processes
      through shared memory.
      
      Processes wishing to receive the membarriers register with
      MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue
      membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED.
      
      This allows extremely simple kernel-level implementation: we have almost
      everything we need with the PRIVATE_EXPEDITED barrier code. All we need
      to do is to add a flag in the mm_struct that will be used to check
      whether we need to send the IPI to the current thread of each CPU.
      
      There is a slight downside to this approach compared to targeting
      specific shared memory users: when performing a membarrier operation,
      all registered "global" receivers will get the barrier, even if they
      don't share a memory mapping with the sender issuing
      MEMBARRIER_CMD_GLOBAL_EXPEDITED.
      
      This registration approach seems to fit the requirement of not
      disturbing processes that really deeply care about real-time: they
      simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
      
      In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED"
      command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of
      MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward
      compatibility.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-5-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c5f58bd5
    • Mathieu Desnoyers's avatar
      membarrier: Document scheduler barrier requirements · 306e0604
      Mathieu Desnoyers authored
      Document the membarrier requirement on having a full memory barrier in
      __schedule() after coming from user-space, before storing to rq->curr.
      It is provided by smp_mb__after_spinlock() in __schedule().
      
      Document that membarrier requires a full barrier on transition from
      kernel thread to userspace thread. We currently have an implicit barrier
      from atomic_dec_and_test() in mmdrop() that ensures this.
      
      The x86 switch_mm_irqs_off() full barrier is currently provided by many
      cpumask update operations as well as write_cr3(). Document that
      write_cr3() provides this barrier.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      306e0604
    • Mathieu Desnoyers's avatar
      powerpc, membarrier: Skip memory barrier in switch_mm() · 3ccfebed
      Mathieu Desnoyers authored
      Allow PowerPC to skip the full memory barrier in switch_mm(), and
      only issue the barrier when scheduling into a task belonging to a
      process that has registered to use expedited private.
      
      Threads targeting the same VM but which belong to different thread
      groups is a tricky case. It has a few consequences:
      
      It turns out that we cannot rely on get_nr_threads(p) to count the
      number of threads using a VM. We can use
      (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
      instead to skip the synchronize_sched() for cases where the VM only has
      a single user, and that user only has a single thread.
      
      It also turns out that we cannot use for_each_thread() to set
      thread flags in all threads using a VM, as it only iterates on the
      thread group.
      
      Therefore, test the membarrier state variable directly rather than
      relying on thread flags. This means
      membarrier_register_private_expedited() needs to set the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
      only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
      private expedited membarrier commands to succeed.
      membarrier_arch_switch_mm() now tests for the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3ccfebed
    • Mathieu Desnoyers's avatar
      membarrier/selftest: Test private expedited command · 667ca1ec
      Mathieu Desnoyers authored
      Test the new MEMBARRIER_CMD_PRIVATE_EXPEDITED and
      MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED commands.
      
      Add checks expecting specific error values on system calls expected to
      fail.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarShuah Khan <shuahkh@osg.samsung.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alice Ferrazzi <alice.ferrazzi@gmail.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Elder <paul.elder@pitt.edu>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-2-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      667ca1ec
    • Linus Torvalds's avatar
      Merge tag 'rproc-v4.16' of git://github.com/andersson/remoteproc · 2deb41b2
      Linus Torvalds authored
      Pull remoteproc updates from Bjorn Andersson:
       "This contains a few bug fixes and a cleanup up of the resource-table
        handling in the framework, which removes the need for drivers with no
        resource table to provide a fake one"
      
      * tag 'rproc-v4.16' of git://github.com/andersson/remoteproc:
        remoteproc: Reset table_ptr on stop
        remoteproc: Drop dangling find_rsc_table dummies
        remoteproc: Move resource table load logic to find
        remoteproc: Don't handle empty resource table
        remoteproc: Merge rproc_ops and rproc_fw_ops
        remoteproc: Clone rproc_ops in rproc_alloc()
        remoteproc: Cache resource table size
        remoteproc: Remove depricated crash completion
        virtio_remoteproc: correct put_device virtio_device.dev
      2deb41b2
    • Linus Torvalds's avatar
      Merge tag 'rpmsg-v4.16' of git://github.com/andersson/remoteproc · 67fb3b92
      Linus Torvalds authored
      Pull rpmsg updates from Bjorn Andersson:
       "This fixes a few issues found in the SMD and GLINK drivers and
        corrects the handling of SMD channels that are found in an
        (previously) unexpected state"
      
      * tag 'rpmsg-v4.16' of git://github.com/andersson/remoteproc:
        rpmsg: smd: Fix double unlock in __qcom_smd_send()
        rpmsg: glink: Fix missing mutex_init() in qcom_glink_alloc_channel()
        rpmsg: smd: Don't hold the tx lock during wait
        rpmsg: smd: Fail send on a closed channel
        rpmsg: smd: Wake up all waiters
        rpmsg: smd: Create device for all channels
        rpmsg: smd: Perform handshake during open
        rpmsg: glink: smem: Ensure ordering during tx
        drivers: rpmsg: remove duplicate includes
        remoteproc: qcom: Use PTR_ERR_OR_ZERO() in glink prob
      67fb3b92
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · ae77c958
      Linus Torvalds authored
      Pull MMC host fixes from Ulf Hansson:
      
       - renesas_sdhi: Fix build error in case NO_DMA=y
      
       - sdhci: Implement a bounce buffer to address throughput regressions
      
      * tag 'mmc-v4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: MMC_SDHI_{SYS,INTERNAL}_DMAC should depend on HAS_DMA
        mmc: sdhci: Implement an SDHCI-specific bounce buffer
      ae77c958
    • Linus Torvalds's avatar
      Merge tag 'pwm/for-4.16-rc1' of... · 20f9aa22
      Linus Torvalds authored
      Merge tag 'pwm/for-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
      
      Pull pwm updates from Thierry Reding:
       "The Meson PWM controller driver gains support for the AXG series and a
        minor bug is fixed for the STMPE driver.
      
        To round things off, the class is now set for PWM channels exported
        via sysfs which allows non-root access, provided that the system has
        been configured accordingly"
      
      * tag 'pwm/for-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
        pwm: meson: Add clock source configuration for Meson-AXG
        dt-bindings: pwm: Update bindings for the Meson-AXG
        pwm: stmpe: Fix wrong register offset for hwpwm=2 case
        pwm: Set class for exported channels in sysfs
      20f9aa22
    • Thierry Reding's avatar
      net: mediatek: Explicitly include pinctrl headers · 140995c9
      Thierry Reding authored
      The Mediatek ethernet driver fails to build after commit 23c35f48
      ("pinctrl: remove include file from <linux/device.h>") because it relies
      on the pinctrl/consumer.h and pinctrl/devinfo.h being pulled in by the
      device.h header implicitly.
      
      Include these headers explicitly to avoid the build failure.
      
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      140995c9
    • Thierry Reding's avatar
      mmc: meson-gx-mmc: Explicitly include pinctr/consumer.h · 8fb572ac
      Thierry Reding authored
      The Meson GX MMC driver fails to build after commit 23c35f48
      ("pinctrl: remove include file from <linux/device.h>") because it relies
      on the pinctrl/consumer.h being pulled in by the device.h header
      implicitly.
      
      Include the header explicitly to avoid the build failure.
      
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fb572ac
    • Thierry Reding's avatar
      drm/rockchip: lvds: Explicitly include pinctrl headers · 1c16a9ce
      Thierry Reding authored
      The Rockchip LVDS driver fails to build after commit 23c35f48
      ("pinctrl: remove include file from <linux/device.h>") because it relies
      on the pinctrl/consumer.h and pinctrl/devinfo.h being pulled in by the
      device.h header implicitly.
      
      Include these headers explicitly to avoid the build failure.
      
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c16a9ce
    • Stephen Rothwell's avatar
      pinctrl: files should directly include apis they use · 567af7fc
      Stephen Rothwell authored
      Fixes: 23c35f48 ("pinctrl: remove include file from <linux/device.h>")
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      567af7fc
    • Max Gurtovoy's avatar
      net/mlx5: increase async EQ to avoid EQ overrun · 03ecdd2d
      Max Gurtovoy authored
      Currently the async EQ has 256 entries only. It might not be big enough
      for the SW to handle all the needed pending events. For example, in case
      of many QPs (let's say 1024) connected to a SRQ created using NVMeOF target
      and the target goes down, the FW will raise 1024 "last WQE reached" events
      and may cause EQ overrun. Increase the EQ to more reasonable size, that beyond
      it the FW should be able to delay the event and raise it later on using internal
      backpressure mechanism.
      Signed-off-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      03ecdd2d
    • Sagi Grimberg's avatar
      mlx5: fix mlx5_get_vector_affinity to start from completion vector 0 · 2572cf57
      Sagi Grimberg authored
      The consumers of this routine expects the affinity map of of vector
      index relative to the first completion vector. The upper layers are
      not aware of internal/private completion vectors that mlx5 allocates
      for its own usage.
      
      Hence, return the affinity map of vector index relative to the first
      completion vector.
      
      Fixes: 05e0cc84 ("net/mlx5: Fix get vector affinity helper function")
      Reported-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Tested-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Cc: <stable@vger.kernel.org> # v4.15
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      2572cf57
    • oulijun's avatar
      RDMA/hns: Fix the endian problem for hns · 8b9b8d14
      oulijun authored
      The hip06 and hip08 run on a little endian ARM, it needs to
      revise the annotations to indicate that the HW uses little
      endian data in the various DMA buffers, and flow the necessary
      swaps throughout.
      
      The imm_data use big endian mode. The cpu_to_le32/le32_to_cpu
      swaps are no-op for this, which makes the only substantive
      change the handling of imm_data which is now mandatory swapped.
      
      This also keep match with the userspace hns driver and resolve
      the warning by sparse.
      Signed-off-by: default avatarLijun Ou <oulijun@huawei.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      8b9b8d14
    • Amir Goldstein's avatar
      ovl: check ERR_PTR() return value from ovl_encode_fh() · 9b6faee0
      Amir Goldstein authored
      Another fix for an issue reported by 0-day robot.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 8ed5eec9 ("ovl: encode pure upper file handles")
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      9b6faee0
    • Amir Goldstein's avatar
      ovl: fix regression in fsnotify of overlay merge dir · 2aed489d
      Amir Goldstein authored
      A re-factoring patch in NFS export series has passed the wrong argument
      to ovl_get_inode() causing a regression in the very recent fix to
      fsnotify of overlay merge dir.
      
      The regression has caused merge directory inodes to be hashed by upper
      instead of lower real inode, when NFS export and directory indexing is
      disabled. That caused an inotify watch to become obsolete after directory
      copy up and drop caches.
      
      LTP test inotify07 was improved to catch this regression.
      The regression also caused multiple redirect dirs to same origin not to
      be detected on lookup with NFS export disabled. An xfstest was added to
      cover this case.
      
      Fixes: 0aceb53e ("ovl: do not pass overlay dentry to ovl_get_inode()")
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      2aed489d