1. 13 Aug, 2014 21 commits
    • Aneesh Kumar K.V's avatar
      powerpc/thp: Handle combo pages in invalidate · fc047955
      Aneesh Kumar K.V authored
      If we changed base page size of the segment, either via sub_page_protect
      or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
      table entries. We do a lazy hash page table flush for all mapped pages
      in the demoted segment. This happens when we handle hash page fault for
      these pages.
      
      We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
      pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
      that implies that we could possibly have older 64K hash pte entries in
      the hash page table and we need to invalidate those entries.
      
      Use _PAGE_COMBO to determine the page size with which we should
      invalidate the hash table entries on unmap.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fc047955
    • Aneesh Kumar K.V's avatar
      powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pte · 629149fa
      Aneesh Kumar K.V authored
      If we changed base page size of the segment, either via sub_page_protect
      or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
      table entries. We do a lazy hash page table flush for all mapped pages
      in the demoted segment. This happens when we handle hash page fault
      for these pages.
      
      We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
      pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
      that implies that we could possibly have older 64K hash pte entries in
      the hash page table and we need to invalidate those entries.
      
      Handle this correctly for 16M pages
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      629149fa
    • Aneesh Kumar K.V's avatar
      powerpc/thp: Don't recompute vsid and ssize in loop on invalidate · fa1f8ae8
      Aneesh Kumar K.V authored
      The segment identifier and segment size will remain the same in
      the loop, So we can compute it outside. We also change the
      hugepage_invalidate interface so that we can use it the later patch
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      fa1f8ae8
    • Aneesh Kumar K.V's avatar
      powerpc/thp: Add write barrier after updating the valid bit · b0aa44a3
      Aneesh Kumar K.V authored
      With hugepages, we store the hpte valid information in the pte page
      whose address is stored in the second half of the PMD. Use a
      write barrier to make sure clearing pmd busy bit and updating
      hpte valid info are ordered properly.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b0aa44a3
    • Nishanth Aravamudan's avatar
      powerpc: reorder per-cpu NUMA information's initialization · 2fabf084
      Nishanth Aravamudan authored
      There is an issue currently where NUMA information is used on powerpc
      (and possibly ia64) before it has been read from the device-tree, which
      leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
      
      NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
      after start_secondary(), similar to ia64, which is invoked via
      smp_init().
      
      Commit 6ee0578b ("workqueue: mark init_workqueues() as
      early_initcall()") made init_workqueues() be invoked via
      do_pre_smp_initcalls(), which is obviously before the secondary
      processors are online.
      
      Additionally, the following commits changed init_workqueues() to use
      cpu_to_node to determine the node to use for kthread_create_on_node:
      
      bce90380 ("workqueue: add wq_numa_tbl_len and
      wq_numa_possible_cpumask[]")
      f3f90ad4 ("workqueue: determine NUMA node of workers accourding to
      the allowed cpumask")
      
      Therefore, when init_workqueues() runs, it sees all CPUs as being on
      Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
      a high number of slab deactivations
      (http://www.spinics.net/lists/linux-mm/msg67489.html).
      
      Fix this by initializing the powerpc-specific CPU<->node/local memory
      node mapping as early as possible, which on powerpc is
      do_init_bootmem(). Currently that function initializes the mapping for
      the boot CPU, but we extend it to setup the mapping for all possible
      CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
      per-cpu values for all possible CPUs. That ensures that before the
      early_initcalls run (and really as early as possible), the per-cpu NUMA
      mapping is accurate.
      
      While testing memoryless nodes on PowerKVM guests with a fix to the
      workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
      guest topology of:
      
      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
      node 0 size: 0 MB
      node 0 free: 0 MB
      node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
      node 1 size: 16336 MB
      node 1 free: 15329 MB
      node distances:
      node   0   1
        0:  10  40
        1:  40  10
      
      the slab consumption decreases from
      
      Slab:             932416 kB
      SUnreclaim:       902336 kB
      
      to
      
      Slab:             395264 kB
      SUnreclaim:       359424 kB
      
      And we a corresponding increase in the slab efficiency from
      
      slab                                   mem     objs    slabs
                                            used   active   active
      ------------------------------------------------------------
      kmalloc-16384                       337 MB   11.28%  100.00%
      task_struct                         288 MB    9.93%  100.00%
      
      to
      
      slab                                   mem     objs    slabs
                                            used   active   active
      ------------------------------------------------------------
      kmalloc-16384                        37 MB  100.00%  100.00%
      task_struct                          31 MB  100.00%  100.00%
      
      Powerpc didn't support memoryless nodes until recently (64bb80d8
      "powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261
      "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
      helped improve memory consumption with these kind of environments.
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2fabf084
    • Himangi Saraogi's avatar
      powerpc/perf/hv-24x7: Use kmem_cache_free · d6589722
      Himangi Saraogi authored
      Free memory allocated using kmem_cache_zalloc using kmem_cache_free
      rather than kfree.
      
      The Coccinelle semantic patch that makes this change is as follows:
      
      // <smpl>
      @@
      expression x,E,c;
      @@
      
       x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
       ... when != x = E
           when != &x
      ?-kfree(x)
      +kmem_cache_free(c,x)
      // </smpl>
      Signed-off-by: default avatarHimangi Saraogi <himangi774@gmail.com>
      Acked-by: default avatarJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d6589722
    • Thomas Falcon's avatar
      powerpc/pseries/hvcserver: Fix endian issue in hvcs_get_partner_info · 587870e8
      Thomas Falcon authored
      A buffer returned by H_VTERM_PARTNER_INFO contains device information
      in big endian format, causing problems for little endian architectures.
      This patch ensures that they are in cpu endian.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      587870e8
    • Anton Blanchard's avatar
      powerpc: Hard disable interrupts in xmon · a71d64b4
      Anton Blanchard authored
      xmon only soft disables interrupts. This seems like a bad idea - we
      certainly don't want decrementer and PMU exceptions going off when
      we are debugging something inside xmon.
      
      This issue was uncovered when the hard lockup detector went off
      inside xmon. To ensure we wont get a spurious hard lockup warning,
      I also call touch_nmi_watchdog() when exiting xmon.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      a71d64b4
    • Nishanth Aravamudan's avatar
      powerpc: remove duplicate definition of TEXASR_FS · 56758e3c
      Nishanth Aravamudan authored
      It appears that commits 7f06f21d ("powerpc/tm: Add checking to
      treclaim/trechkpt") and e4e38121 ("KVM: PPC: Book3S HV: Add
      transactional memory support") both added definitions of TEXASR_FS.
      Remove one of them. At the same time, fix the alignment of the remaining
      definition (should be tab-separated like the rest of the #defines).
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      56758e3c
    • Gavin Shan's avatar
      powerpc/pseries: Avoid deadlock on removing ddw · 5efbabe0
      Gavin Shan authored
      Function remove_ddw() could be called in of_reconfig_notifier and
      we potentially remove the dynamic DMA window property, which invokes
      of_reconfig_notifier again. Eventually, it leads to the deadlock as
      following backtrace shows.
      
      The patch fixes the above issue by deferring releasing the dynamic
      DMA window property while releasing the device node.
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.16.0+ #428 Tainted: G        W
      ---------------------------------------------
      drmgr/2273 is trying to acquire lock:
       ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \
       .__blocking_notifier_call_chain+0x40/0x78
      
      but task is already holding lock:
       ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \
       .__blocking_notifier_call_chain+0x40/0x78
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock((of_reconfig_chain).rwsem);
        lock((of_reconfig_chain).rwsem);
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      2 locks held by drmgr/2273:
       #0:  (sb_writers#4){.+.+.+}, at: [<c0000000001cbe70>] \
            .vfs_write+0xb0/0x1f8
       #1:  ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \
            .__blocking_notifier_call_chain+0x40/0x78
      
      stack backtrace:
      CPU: 17 PID: 2273 Comm: drmgr Tainted: G        W     3.16.0+ #428
      Call Trace:
      [c0000000137e7000] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
      [c0000000137e70b0] [c00000000083cd34] .dump_stack+0x7c/0x9c
      [c0000000137e7130] [c0000000000b8afc] .__lock_acquire+0x128c/0x1c68
      [c0000000137e7280] [c0000000000b9a4c] .lock_acquire+0xe8/0x104
      [c0000000137e7350] [c00000000083588c] .down_read+0x4c/0x90
      [c0000000137e73e0] [c000000000091890] .__blocking_notifier_call_chain+0x40/0x78
      [c0000000137e7490] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
      [c0000000137e7520] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
      [c0000000137e75b0] [c000000000682a9c] .of_property_notify+0x4c/0x54
      [c0000000137e7650] [c000000000682bf0] .of_remove_property+0x30/0xd4
      [c0000000137e76f0] [c000000000052a44] .remove_ddw+0x144/0x168
      [c0000000137e7790] [c000000000053204] .iommu_reconfig_notifier+0x30/0xe0
      [c0000000137e7820] [c00000000009137c] .notifier_call_chain+0x6c/0xb4
      [c0000000137e78c0] [c0000000000918ac] .__blocking_notifier_call_chain+0x5c/0x78
      [c0000000137e7970] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
      [c0000000137e7a00] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
      [c0000000137e7a90] [c000000000682e14] .of_detach_node+0x44/0x1fc
      [c0000000137e7b40] [c0000000000518e4] .ofdt_write+0x3ac/0x688
      [c0000000137e7c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
      [c0000000137e7cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
      [c0000000137e7d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
      [c0000000137e7e30] [c00000000000a064] syscall_exit+0x0/0x98
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5efbabe0
    • Gavin Shan's avatar
      powerpc/pseries: Failure on removing device node · f1b3929c
      Gavin Shan authored
      While running command "drmgr -c phb -r -s 'PHB 528'", following
      backtrace jumped out because the target device node isn't marked
      with OF_DETACHED by of_detach_node(), which caused by error
      returned from memory hotplug related reconfig notifier when
      disabling CONFIG_MEMORY_HOTREMOVE. The patch fixes it.
      
      ERROR: Bad of_node_put() on /pci@800000020000210/ethernet@0
      CPU: 14 PID: 2252 Comm: drmgr Tainted: G        W     3.16.0+ #427
      Call Trace:
      [c000000012a776a0] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
      [c000000012a77750] [c00000000083cd34] .dump_stack+0x7c/0x9c
      [c000000012a777d0] [c0000000006807c4] .of_node_release+0x58/0xe0
      [c000000012a77860] [c00000000038a7d0] .kobject_release+0x174/0x1b8
      [c000000012a77900] [c00000000038a884] .kobject_put+0x70/0x78
      [c000000012a77980] [c000000000681680] .of_node_put+0x28/0x34
      [c000000012a77a00] [c000000000681ea8] .__of_get_next_child+0x64/0x70
      [c000000012a77a90] [c000000000682138] .of_find_node_by_path+0x1b8/0x20c
      [c000000012a77b40] [c000000000051840] .ofdt_write+0x308/0x688
      [c000000012a77c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
      [c000000012a77cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
      [c000000012a77d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
      [c000000012a77e30] [c00000000000a064] syscall_exit+0x0/0x98
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f1b3929c
    • Benjamin Herrenschmidt's avatar
      powerpc/boot: Use correct zlib types for comparison · ce8f150a
      Benjamin Herrenschmidt authored
      Avoids this warning:
      
      arch/powerpc/boot/gunzip_util.c:118:9: warning: comparison of distinct pointer types lacks a cast
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ce8f150a
    • Vasant Hegde's avatar
      powerpc/powernv: Interface to register/unregister opal dump region · b09c2ec4
      Vasant Hegde authored
      PowerNV platform is capable of capturing host memory region when system
      crashes (because of host/firmware). We have new OPAL API to register/
      unregister memory region to be captured when system crashes.
      
      This patch adds support for new API. Also during boot time we register
      kernel log buffer and unregister before doing kexec.
      Signed-off-by: default avatarVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b09c2ec4
    • Vasant Hegde's avatar
      printk: Add function to return log buffer address and size · 14c4000a
      Vasant Hegde authored
      Platforms like IBM Power Systems supports service processor
      assisted dump. It provides interface to add memory region to
      be captured when system is crashed.
      
      During initialization/running we can add kernel memory region
      to be collected.
      
      Presently we don't have a way to get the log buffer base address
      and size. This patch adds support to return log buffer address
      and size.
      Signed-off-by: default avatarVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14c4000a
    • Michael Ellerman's avatar
      powerpc: Add POWER8 features to CPU_FTRS_POSSIBLE/ALWAYS · 3609e09f
      Michael Ellerman authored
      We have been a bit slack about updating the CPU_FTRS_POSSIBLE and
      CPU_FTRS_ALWAYS masks. When we added POWER8, and also POWER8E we forgot
      to update the ALWAYS mask. And when we added POWER8_DD1 we forgot to
      update both the POSSIBLE and ALWAYS masks.
      
      Luckily this hasn't caused any actual bugs AFAICS. Failing to update the
      ALWAYS mask just forgoes a potential optimisation opportunity. Failing
      to update the POSSIBLE mask for POWER8_DD1 is also OK because it only
      removes a bit rather than adding any.
      
      Regardless they should all be in both masks so as to avoid any future
      bugs when the set of ALWAYS/POSSIBLE bits changes, or the masks
      themselves change.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarMichael Neuling <mikey@neuling.org>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3609e09f
    • Alistair Popple's avatar
      powerpc/ppc476: Disable BTAC · 97b3be1e
      Alistair Popple authored
      This patch disables the branch target address CAM which under specific
      circumstances may cause the processor to skip execution of 1-4
      instructions. This fixes IBM Erratum #47.
      Signed-off-by: default avatarAlistair Popple <alistair@popple.id.au>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      97b3be1e
    • Gavin Shan's avatar
      powerpc/powernv: Fix IOMMU group lost · 763fe0ad
      Gavin Shan authored
      When we take full hotplug to recover from EEH errors, PCI buses
      could be involved. For the case, the child devices of involved
      PCI buses can't be attached to IOMMU group properly, which is
      caused by commit 3f28c5af ("powerpc/powernv: Reduce multi-hit of
      iommu_add_device()").
      
      When adding the PCI devices of the newly created PCI buses to
      the system, the IOMMU group is expected to be added in (C).
      (A) fails to bind the IOMMU group because bus->is_added is
      false. (B) fails because the device doesn't have binding IOMMU
      table yet. bus->is_added is set to true at end of (C) and
      pdev->is_added is set to true at (D).
      
         pcibios_add_pci_devices()
            pci_scan_bridge()
               pci_scan_child_bus()
                  pci_scan_slot()
                     pci_scan_single_device()
                        pci_scan_device()
                        pci_device_add()
                           pcibios_add_device()           A: Ignore
                           device_add()                   B: Ignore
                        pcibios_fixup_bus()
                           pcibios_setup_bus_devices()
                              pcibios_setup_device()      C: Hit
            pcibios_finish_adding_to_bus()
               pci_bus_add_devices()
                  pci_bus_add_device()                    D: Add device
      
      If the parent PCI bus isn't involved in hotplug, the IOMMU
      group is expected to be bound in (B). (A) should fail as the
      sysfs entries aren't populated.
      
      The patch fixes the issue by reverting commit 3f28c5af and remove
      WARN_ON() in iommu_add_device() to allow calling the function
      even the specified device already has associated IOMMU group.
      
      Cc: <stable@vger.kernel.org>  # 3.16+
      Reported-by: default avatarThadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      763fe0ad
    • Michael Ellerman's avatar
      powerpc: Add smp_mb()s to arch_spin_unlock_wait() · 78e05b14
      Michael Ellerman authored
      Similar to the previous commit which described why we need to add a
      barrier to arch_spin_is_locked(), we have a similar problem with
      spin_unlock_wait().
      
      We need a barrier on entry to ensure any spinlock we have previously
      taken is visibly locked prior to the load of lock->slock.
      
      It's also not clear if spin_unlock_wait() is intended to have ACQUIRE
      semantics. For now be conservative and add a barrier on exit to give it
      ACQUIRE semantics.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      78e05b14
    • Michael Ellerman's avatar
      powerpc: Add smp_mb() to arch_spin_is_locked() · 51d7d520
      Michael Ellerman authored
      The kernel defines the function spin_is_locked(), which can be used to
      check if a spinlock is currently locked.
      
      Using spin_is_locked() on a lock you don't hold is obviously racy. That
      is, even though you may observe that the lock is unlocked, it may become
      locked at any time.
      
      There is (at least) one exception to that, which is if two locks are
      used as a pair, and the holder of each checks the status of the other
      before doing any update.
      
      Assuming *A and *B are two locks, and *COUNTER is a shared non-atomic
      value:
      
      The first CPU does:
      
      	spin_lock(*A)
      
      	if spin_is_locked(*B)
      		# nothing
      	else
      		smp_mb()
      		LOAD r = *COUNTER
      		r++
      		STORE *COUNTER = r
      
      	spin_unlock(*A)
      
      And the second CPU does:
      
      	spin_lock(*B)
      
      	if spin_is_locked(*A)
      		# nothing
      	else
      		smp_mb()
      		LOAD r = *COUNTER
      		r++
      		STORE *COUNTER = r
      
      	spin_unlock(*B)
      
      Although this is a strange locking construct, it should work.
      
      It seems to be understood, but not documented, that spin_is_locked() is
      not a memory barrier, so in the examples above and below the caller
      inserts its own memory barrier before acting on the result of
      spin_is_locked().
      
      For now we assume spin_is_locked() is implemented as below, and we break
      it out in our examples:
      
      	bool spin_is_locked(*LOCK) {
      		LOAD l = *LOCK
      		return l.locked
      	}
      
      Our intuition is that there should be no problem even if the two code
      sequences run simultaneously such as:
      
      	CPU 0			CPU 1
      	==================================================
      	spin_lock(*A)		spin_lock(*B)
      	LOAD b = *B		LOAD a = *A
      	if b.locked # true	if a.locked # true
      	# nothing		# nothing
      	spin_unlock(*A)		spin_unlock(*B)
      
      If one CPU gets the lock before the other then it will do the update and
      the other CPU will back off:
      
      	CPU 0			CPU 1
      	==================================================
      	spin_lock(*A)
      	LOAD b = *B
      				spin_lock(*B)
      	if b.locked # false	LOAD a = *A
      	else			if a.locked # true
      	smp_mb()		# nothing
      	LOAD r1 = *COUNTER	spin_unlock(*B)
      	r1++
      	STORE *COUNTER = r1
      	spin_unlock(*A)
      
      However in reality spin_lock() itself is not indivisible. On powerpc we
      implement it as a load-and-reserve and store-conditional.
      
      Ignoring the retry logic for the lost reservation case, it boils down to:
      	spin_lock(*LOCK) {
      		LOAD l = *LOCK
      		l.locked = true
      		STORE *LOCK = l
      		ACQUIRE_BARRIER
      	}
      
      The ACQUIRE_BARRIER is required to give spin_lock() ACQUIRE semantics as
      defined in memory-barriers.txt:
      
           This acts as a one-way permeable barrier.  It guarantees that all
           memory operations after the ACQUIRE operation will appear to happen
           after the ACQUIRE operation with respect to the other components of
           the system.
      
      On modern powerpc systems we use lwsync for ACQUIRE_BARRIER. lwsync is
      also know as "lightweight sync", or "sync 1".
      
      As described in Power ISA v2.07 section B.2.1.1, in this scenario the
      lwsync is not the barrier itself. It instead causes the LOAD of *LOCK to
      act as the barrier, preventing any loads or stores in the locked region
      from occurring prior to the load of *LOCK.
      
      Whether this behaviour is in accordance with the definition of ACQUIRE
      semantics in memory-barriers.txt is open to discussion, we may switch to
      a different barrier in future.
      
      What this means in practice is that the following can occur:
      
      	CPU 0			CPU 1
      	==================================================
      	LOAD a = *A 		LOAD b = *B
      	a.locked = true		b.locked = true
      	LOAD b = *B		LOAD a = *A
      	STORE *A = a		STORE *B = b
      	if b.locked # false	if a.locked # false
      	else			else
      	smp_mb()		smp_mb()
      	LOAD r1 = *COUNTER	LOAD r2 = *COUNTER
      	r1++			r2++
      	STORE *COUNTER = r1
      				STORE *COUNTER = r2	# Lost update
      	spin_unlock(*A)		spin_unlock(*B)
      
      That is, the load of *B can occur prior to the store that makes *A
      visibly locked. And similarly for CPU 1. The result is both CPUs hold
      their lock and believe the other lock is unlocked.
      
      The easiest fix for this is to add a full memory barrier to the start of
      spin_is_locked(), so adding to our previous definition would give us:
      
      	bool spin_is_locked(*LOCK) {
      		smp_mb()
      		LOAD l = *LOCK
      		return l.locked
      	}
      
      The new barrier orders the store to the lock we are locking vs the load
      of the other lock:
      
      	CPU 0			CPU 1
      	==================================================
      	LOAD a = *A 		LOAD b = *B
      	a.locked = true		b.locked = true
      	STORE *A = a		STORE *B = b
      	smp_mb()		smp_mb()
      	LOAD b = *B		LOAD a = *A
      	if b.locked # true	if a.locked # true
      	# nothing		# nothing
      	spin_unlock(*A)		spin_unlock(*B)
      
      Although the above example is theoretical, there is code similar to this
      example in sem_lock() in ipc/sem.c. This commit in addition to the next
      commit appears to be a fix for crashes we are seeing in that code where
      we believe this race happens in practice.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      51d7d520
    • Guenter Roeck's avatar
      powerpc: Fix "attempt to move .org backwards" error · 11d54904
      Guenter Roeck authored
      Once again, we see
      
      arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
      arch/powerpc/kernel/exceptions-64s.S:865: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:866: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:890: Error: attempt to move .org backwards
      
      when compiling ppc:allmodconfig.
      
      This time the problem has been caused by to commit 0869b6fd
      ("powerpc/book3s: Add basic infrastructure to handle HMI in Linux"),
      which adds functions hmi_exception_early and hmi_exception_after_realmode
      into a critical (size-limited) code area, even though that does not appear
      to be necessary.
      
      Move those functions to a non-critical area of the file.
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      11d54904
    • Scott Wood's avatar
      powerpc/nohash: Split __early_init_mmu() into boot and secondary · 5d61a217
      Scott Wood authored
      __early_init_mmu() does some things that are really only needed by the
      boot cpu.  On FSL booke, This includes calling
      memblock_enforce_memory_limit(), which is labelled __init.  Secondary
      cpu init code can't be __init as that would break CPU hotplug.
      
      While it's probably a bug that memblock_enforce_memory_limit() isn't
      __init_memblock instead, there's no reason why we should be doing this
      stuff for secondary cpus in the first place.
      Signed-off-by: default avatarScott Wood <scottwood@freescale.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5d61a217
  2. 10 Aug, 2014 6 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/olof/chrome-platform · 58d08e3b
      Linus Torvalds authored
      Pull chrome platform updates from Olof Johansson:
       "Updates to the Chromebook/box platform drivers:
      
         - a bugfix to pstore registration that makes it also work on
           non-Google systems
         - addition of new shipped Chromebooks (later models have more probing
           through ACPI so the need for these updates will be less over time).
         - A couple of minor coding style updates"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/olof/chrome-platform:
        platform/chrome: chromeos_laptop - Add a limit for deferred retries
        platform/chrome: Add support for the acer c720p touchscreen.
        platform/chrome: pstore: fix dmi table to match all chrome systems
        platform/chrome: coding style fixes
        platform/chrome: chromeos_laptop - Add Toshiba CB35 Touch
        platform/chrome: chromeos_laptop - Add Dell Chromebook 11 touch
        platform/chrome: chromeos_laptop - Add HP Chromebook 14
        platform/chrome: chromeos_laptop - Add support for Acer C720
      58d08e3b
    • Linus Torvalds's avatar
      Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 64e3bbc7
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       - a short branch of OMAP fixes that we didn't merge before the window
         opened.
       - a small cleanup that sorts the rk3288 dts entries properly
       - a build fix due to a reference to a removed DT node on exynos
      
      * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        ARM: dts: exynos5420: remove disp_pd
        ARM: EXYNOS: Fix suspend/resume sequences
        ARM: dts: Fix the sort ordering of EHCI and HSIC in rk3288.dtsi
        ARM: OMAP3: Fix coding style problems in arch/arm/mach-omap2/control.c
        ARM: OMAP3: Fix choice of omap3_restore_es function in OMAP34XX rev3.1.2 case.
        ARM: OMAP2+: clock: allow omap2_dpll_round_rate() to round to next-lowest rate
      64e3bbc7
    • Linus Torvalds's avatar
      Merge branch 'linux-3.17' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 · 91384758
      Linus Torvalds authored
      Pull nouveau drm updates from Ben Skeggs:
       "Apologies for not getting this done in time for Dave's drm-next merge
        window.  As he mentioned, a pre-existing bug reared its head a lot
        more obviously after this lot of changes.  It took quite a bit of time
        to track it down.  In any case, Dave suggested I try my luck by
        sending directly to you this time.
      
        Overview:
      
         - more code for Tegra GK20A from NVIDIA - probing, reclockig
         - better fix for Kepler GPUs that have the graphics engine powered
           off on startup, method courtesy of info provided by NVIDIA
         - unhardcoding of a bunch of graphics engine setup on
           Fermi/Kepler/Maxwell, will hopefully solve some issues people have
           noticed on higher-end models
         - support for "Zero Bandwidth Clear" on Fermi/Kepler/Maxwell, needs
           userspace support in general, but some lucky apps will benefit
           automagically
         - reviewed/exposed the full object APIs to userspace (finally), gives
           it access to perfctrs, ZBC controls, various events.  More to come
           in the future.
         - various other fixes"
      Acked-by: default avatarDave Airlie <airlied@redhat.com>
      
      * 'linux-3.17' of git://anongit.freedesktop.org/git/nouveau/linux-2.6: (87 commits)
        drm/nouveau: expose the full object/event interfaces to userspace
        drm/nouveau: fix headless mode
        drm/nouveau: hide sysfs pstate file behind an option again
        drm/nv50/disp: shhh compiler
        drm/gf100-/gr: implement the proper SetShaderExceptions method
        drm/gf100-/gr: remove some broken ltc bashing, for now
        drm/gf100-/gr: unhardcode attribute cb config
        drm/gf100-/gr: fetch tpcs-per-ppc info on startup
        drm/gf100-/gr: unhardcode pagepool config
        drm/gf100-/gr: unhardcode bundle cb config
        drm/gf100-/gr: improve initial context patch list helpers
        drm/gf100-/gr: add support for zero bandwidth clear
        drm/nouveau/ltc: add zbc drivers
        drm/nouveau/ltc: s/ltcg/ltc/ + cleanup
        drm/nouveau: use ram info from nvif_device
        drm/nouveau/disp: implement nvif event sources for vblank/connector notifiers
        drm/nouveau/disp: allow user direct access to channel control registers
        drm/nouveau/disp: audit and version display classes
        drm/nouveau/disp: audit and version SCANOUTPOS method
        drm/nv50-/disp: audit and version PIOR_PWR method
        ...
      91384758
    • Linus Torvalds's avatar
      Merge tag 'trace-ipi-tracepoints' of... · c23190c0
      Linus Torvalds authored
      Merge tag 'trace-ipi-tracepoints' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
      
      Pull IPI tracepoints for ARM from Steven Rostedt:
       "Nicolas Pitre added generic tracepoints for tracing IPIs and updated
        the arm and arm64 architectures.  It required some minor updates to
        the generic tracepoint system, so it had to wait for me to implement
        them"
      
      * tag 'trace-ipi-tracepoints' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ARM64: add IPI tracepoints
        ARM: add IPI tracepoints
        tracepoint: add generic tracepoint definitions for IPI tracing
        tracing: Do not do anything special with tracepoint_string when tracing is disabled
      c23190c0
    • Linus Torvalds's avatar
      Merge tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · fc335c1b
      Linus Torvalds authored
      Pull trace file read iterator fixes from Steven Rostedt:
       "This contains a fix for two long standing bugs.  Both of which are
        rarely ever hit, and requires the user to do something that users
        rarely do.  It took a few special test cases to even trigger this bug,
        and one of them was just one test in the process of finishing up as
        another one started.
      
        Both bugs have to do with the ring buffer iterator rb_iter_peek(), but
        one is more indirect than the other.
      
        The fist bug fix is simply an increase in the safety net loop counter.
        The counter makes sure that the rb_iter_peek() only iterates the
        number of times we expect it can, and no more.  Well, there was one
        way it could iterate one more than we expected, and that caused the
        ring buffer to shutdown with a nasty warning.  The fix was simply to
        up that counter by one.
      
        The other bug has to be with rb_iter_reset() (called by
        rb_iter_peek()).  This happens when a user reads both the trace_pipe
        and trace files.  The trace_pipe is a consuming read and does not use
        the ring buffer iterator, but the trace file is not a consuming read
        and does use the ring buffer iterator.  When the trace file is being
        read, if it detects that a consuming read occurred, it resets the
        iterator and starts over.  But the reset code that does this
        (rb_iter_reset()), checks if the reader_page is linked to the ring
        buffer or not, and will look into the ring buffer itself if it is not.
        This is wrong, as it should always try to read the reader page first.
        Not to mention, the code that looked into the ring buffer did it
        wrong, and used the header_page "read" offset to start reading on that
        page.  That offset is bogus for pages in the writable ring buffer, and
        was corrupting the iterator, and it would start returning bogus
        events"
      
      * tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ring-buffer: Always reset iterator to reader page
        ring-buffer: Up rb_iter_peek() loop count to 3
      fc335c1b
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace · 77e40aae
      Linus Torvalds authored
      Pull namespace updates from Eric Biederman:
       "This is a bunch of small changes built against 3.16-rc6.  The most
        significant change for users is the first patch which makes setns
        drmatically faster by removing unneded rcu handling.
      
        The next chunk of changes are so that "mount -o remount,.." will not
        allow the user namespace root to drop flags on a mount set by the
        system wide root.  Aks this forces read-only mounts to stay read-only,
        no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
        mounts to stay no exec and it prevents unprivileged users from messing
        with a mounts atime settings.  I have included my test case as the
        last patch in this series so people performing backports can verify
        this change works correctly.
      
        The next change fixes a bug in NFS that was discovered while auditing
        nsproxy users for the first optimization.  Today you can oops the
        kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
        with pid namespaces.  I rebased and fixed the build of the
        !CONFIG_NFS_FS case yesterday when a build bot caught my typo.  Given
        that no one to my knowledge bases anything on my tree fixing the typo
        in place seems more responsible that requiring a typo-fix to be
        backported as well.
      
        The last change is a small semantic cleanup introducing
        /proc/thread-self and pointing /proc/mounts and /proc/net at it.  This
        prevents several kinds of problemantic corner cases.  It is a
        user-visible change so it has a minute chance of causing regressions
        so the change to /proc/mounts and /proc/net are individual one line
        commits that can be trivially reverted.  Unfortunately I lost and
        could not find the email of the original reporter so he is not
        credited.  From at least one perspective this change to /proc/net is a
        refgression fix to allow pthread /proc/net uses that were broken by
        the introduction of the network namespace"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
        proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
        proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
        proc: Implement /proc/thread-self to point at the directory of the current thread
        proc: Have net show up under /proc/<tgid>/task/<tid>
        NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
        mnt: Add tests for unprivileged remount cases that have found to be faulty
        mnt: Change the default remount atime from relatime to the existing value
        mnt: Correct permission checks in do_remount
        mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
        mnt: Only change user settable mount flags in remount
        namespaces: Use task_lock and not rcu to protect nsproxy
      77e40aae
  3. 09 Aug, 2014 13 commits
    • Linus Torvalds's avatar
      Merge branch 'stable-3.17' of git://git.infradead.org/users/pcmoore/selinux · 96784de5
      Linus Torvalds authored
      Pull SElinux fixes from Paul Moore:
       "Two small patches to fix a couple of build warnings in SELinux and
        NetLabel.  The patches are obvious enough that I don't think any
        additional explanation is necessary, but it basically boils down to
        the usual: I was stupid, and these patches fix some of the stupid.
      
        Both patches were posted earlier this week to the SELinux list, and
        that is where they sat as I didn't think there were noteworthy enough
        to go upstream at this point in time, but DaveM would rather see them
        upstream now so who am I to argue.  As the patches are both very
        small"
      
      * 'stable-3.17' of git://git.infradead.org/users/pcmoore/selinux:
        selinux: remove unused variabled in the netport, netnode, and netif caches
        netlabel: fix the netlbl_catmap_setlong() dummy function
      96784de5
    • Linus Torvalds's avatar
      Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux · 0d10c2c1
      Linus Torvalds authored
      Pull nfsd updates from Bruce Fields:
       "This includes a major rewrite of the NFSv4 state code, which has
        always depended on a single mutex.  As an example, open creates are no
        longer serialized, fixing a performance regression on NFSv3->NFSv4
        upgrades.  Thanks to Jeff, Trond, and Benny, and to Christoph for
        review.
      
        Also some RDMA fixes from Chuck Lever and Steve Wise, and
        miscellaneous fixes from Kinglong Mee and others"
      
      * 'for-3.17' of git://linux-nfs.org/~bfields/linux: (167 commits)
        svcrdma: remove rdma_create_qp() failure recovery logic
        nfsd: add some comments to the nfsd4 object definitions
        nfsd: remove the client_mutex and the nfs4_lock/unlock_state wrappers
        nfsd: remove nfs4_lock_state: nfs4_state_shutdown_net
        nfsd: remove nfs4_lock_state: nfs4_laundromat
        nfsd: Remove nfs4_lock_state(): reclaim_complete()
        nfsd: Remove nfs4_lock_state(): setclientid, setclientid_confirm, renew
        nfsd: Remove nfs4_lock_state(): exchange_id, create/destroy_session()
        nfsd: Remove nfs4_lock_state(): nfsd4_open and nfsd4_open_confirm
        nfsd: Remove nfs4_lock_state(): nfsd4_delegreturn()
        nfsd: Remove nfs4_lock_state(): nfsd4_open_downgrade + nfsd4_close
        nfsd: Remove nfs4_lock_state(): nfsd4_lock/locku/lockt()
        nfsd: Remove nfs4_lock_state(): nfsd4_release_lockowner
        nfsd: Remove nfs4_lock_state(): nfsd4_test_stateid/nfsd4_free_stateid
        nfsd: Remove nfs4_lock_state(): nfs4_preprocess_stateid_op()
        nfsd: remove old fault injection infrastructure
        nfsd: add more granular locking to *_delegations fault injectors
        nfsd: add more granular locking to forget_openowners fault injector
        nfsd: add more granular locking to forget_locks fault injector
        nfsd: add a list_head arg to nfsd_foreach_client_lock
        ...
      0d10c2c1
    • Linus Torvalds's avatar
      Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 · 023f78b0
      Linus Torvalds authored
      Pull CIFS updates from Steve French:
       "The most visible change in this set is the additional of multi-credit
        support for SMB2/SMB3 which dramatically improves the large file i/o
        performance for these dialects and significantly increases the maximum
        i/o size used on the wire for SMB2/SMB3.
      
        Also reconnection behavior after network failure is improved"
      
      * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (35 commits)
        Add worker function to set allocation size
        [CIFS] Fix incorrect hex vs. decimal in some debug print statements
        update CIFS TODO list
        Add Pavel to contributor list in cifs AUTHORS file
        Update cifs version
        CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2
        CIFS: Optimize readpages in a short read case on reconnects
        CIFS: Optimize cifs_user_read() in a short read case on reconnects
        CIFS: Improve indentation in cifs_user_read()
        CIFS: Fix possible buffer corruption in cifs_user_read()
        CIFS: Count got bytes in read_into_pages()
        CIFS: Use separate var for the number of bytes got in async read
        CIFS: Indicate reconnect with ECONNABORTED error code
        CIFS: Use multicredits for SMB 2.1/3 reads
        CIFS: Fix rsize usage for sync read
        CIFS: Fix rsize usage in user read
        CIFS: Separate page reading from user read
        CIFS: Fix rsize usage in readpages
        CIFS: Separate page search from readpages
        CIFS: Use multicredits for SMB 2.1/3 writes
        ...
      023f78b0
    • Ben Skeggs's avatar
    • Ben Skeggs's avatar
      drm/nouveau: fix headless mode · 771fa0e4
      Ben Skeggs authored
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      771fa0e4
    • Ben Skeggs's avatar
      drm/nouveau: hide sysfs pstate file behind an option again · 0d48b58a
      Ben Skeggs authored
      No-one has yet had time to move this to debugfs as discussed during
      the last merge window.  Until this happens, hide the option to make
      it clear it's not going to be here forever.
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      0d48b58a
    • Ben Skeggs's avatar
      drm/nv50/disp: shhh compiler · c354080d
      Ben Skeggs authored
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      c354080d
    • Ben Skeggs's avatar
      drm/gf100-/gr: implement the proper SetShaderExceptions method · d6bd3803
      Ben Skeggs authored
      We have another version of it implemented in SW, however, that version
      isn't serialised with normal PGRAPH operation and can possibly clobber
      the enables for another context.
      
      This is the same method that's implemented by the NVIDIA binary driver.
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      d6bd3803
    • Ben Skeggs's avatar
      drm/gf100-/gr: remove some broken ltc bashing, for now · e8873773
      Ben Skeggs authored
      ... and hope that the defaults are good enough.  This was always
      supposed to be a read/modify/write thing anyway, so we're writing
      very wrong stuff for some boards already.
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      e8873773
    • Ben Skeggs's avatar
      drm/gf100-/gr: unhardcode attribute cb config · 67cfbfdf
      Ben Skeggs authored
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      67cfbfdf
    • Ben Skeggs's avatar
      b81146b0
    • Ben Skeggs's avatar
      drm/gf100-/gr: unhardcode pagepool config · f331a15f
      Ben Skeggs authored
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      f331a15f
    • Ben Skeggs's avatar
      drm/gf100-/gr: unhardcode bundle cb config · aa2d58c3
      Ben Skeggs authored
      Should be the same values as before, except:
      
      GF117 has smaller buffer allocated, as per register setup.
      GK20A now uses values from Tegra driver, not GK104's.
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      aa2d58c3