1. 15 May, 2020 9 commits
    • Michael Ellerman's avatar
      selftests/powerpc: Add a test of counting larx/stcx · 7481cad4
      Michael Ellerman authored
      This is based on the count_instructions test.
      
      However this one also counts the number of failed stcx's, and in
      conjunction with knowing the size of the stcx loop, can calculate the
      total number of instructions executed even in the face of
      non-deterministic stcx failures.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200426114410.3917383-1-mpe@ellerman.id.au
      7481cad4
    • Michael Ellerman's avatar
      powerpc: Drop unneeded cast in task_pt_regs() · 24ac99e9
      Michael Ellerman authored
      There's no need to cast in task_pt_regs() as tsk->thread.regs should
      already be a struct pt_regs. If someone's using task_pt_regs() on
      something that's not a task but happens to have a thread.regs then
      we'll deal with them later.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200428123152.73566-1-mpe@ellerman.id.au
      24ac99e9
    • Michael Ellerman's avatar
      powerpc/64: Don't initialise init_task->thread.regs · 7ffa8b7d
      Michael Ellerman authored
      Aneesh increased the size of struct pt_regs by 16 bytes and started
      seeing this WARN_ON:
      
        smp: Bringing up secondary CPUs ...
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at arch/powerpc/kernel/process.c:455 giveup_all+0xb4/0x110
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-rc2-gcc-8.2.0-1.g8f6a41f-default+ #318
        NIP:  c00000000001a2b4 LR: c00000000001a29c CTR: c0000000031d0000
        REGS: c0000000026d3980 TRAP: 0700   Not tainted  (5.7.0-rc2-gcc-8.2.0-1.g8f6a41f-default+)
        MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48048224  XER: 00000000
        CFAR: c000000000019cc8 IRQMASK: 1
        GPR00: c00000000001a264 c0000000026d3c20 c0000000026d7200 800000000280b033
        GPR04: 0000000000000001 0000000000000000 0000000000000077 30206d7372203164
        GPR08: 0000000000002000 0000000002002000 800000000280b033 3230303030303030
        GPR12: 0000000000008800 c0000000031d0000 0000000000800050 0000000002000066
        GPR16: 000000000309a1a0 000000000309a4b0 000000000309a2d8 000000000309a890
        GPR20: 00000000030d0098 c00000000264da40 00000000fd620000 c0000000ff798080
        GPR24: c00000000264edf0 c0000001007469f0 00000000fd620000 c0000000020e5e90
        GPR28: c00000000264edf0 c00000000264d200 000000001db60000 c00000000264d200
        NIP [c00000000001a2b4] giveup_all+0xb4/0x110
        LR [c00000000001a29c] giveup_all+0x9c/0x110
        Call Trace:
        [c0000000026d3c20] [c00000000001a264] giveup_all+0x64/0x110 (unreliable)
        [c0000000026d3c90] [c00000000001ae34] __switch_to+0x104/0x480
        [c0000000026d3cf0] [c000000000e0b8a0] __schedule+0x320/0x970
        [c0000000026d3dd0] [c000000000e0c518] schedule_idle+0x38/0x70
        [c0000000026d3df0] [c00000000019c7c8] do_idle+0x248/0x3f0
        [c0000000026d3e70] [c00000000019cbb8] cpu_startup_entry+0x38/0x40
        [c0000000026d3ea0] [c000000000011bb0] rest_init+0xe0/0xf8
        [c0000000026d3ed0] [c000000002004820] start_kernel+0x990/0x9e0
        [c0000000026d3f90] [c00000000000c49c] start_here_common+0x1c/0x400
      
      Which was unexpected. The warning is checking the thread.regs->msr
      value of the task we are switching from:
      
        usermsr = tsk->thread.regs->msr;
        ...
        WARN_ON((usermsr & MSR_VSX) && !((usermsr & MSR_FP) && (usermsr & MSR_VEC)));
      
      ie. if MSR_VSX is set then both of MSR_FP and MSR_VEC are also set.
      
      Dumping tsk->thread.regs->msr we see that it's: 0x1db60000
      
      Which is not a normal looking MSR, in fact the only valid bit is
      MSR_VSX, all the other bits are reserved in the current definition of
      the MSR.
      
      We can see from the oops that it was swapper/0 that we were switching
      from when we hit the warning, ie. init_task. So its thread.regs points
      to the base (high addresses) in init_stack.
      
      Dumping the content of init_task->thread.regs, with the members of
      pt_regs annotated (the 16 bytes larger version), we see:
      
        0000000000000000 c000000002780080    gpr[0]     gpr[1]
        0000000000000000 c000000002666008    gpr[2]     gpr[3]
        c0000000026d3ed0 0000000000000078    gpr[4]     gpr[5]
        c000000000011b68 c000000002780080    gpr[6]     gpr[7]
        0000000000000000 0000000000000000    gpr[8]     gpr[9]
        c0000000026d3f90 0000800000002200    gpr[10]    gpr[11]
        c000000002004820 c0000000026d7200    gpr[12]    gpr[13]
        000000001db60000 c0000000010aabe8    gpr[14]    gpr[15]
        c0000000010aabe8 c0000000010aabe8    gpr[16]    gpr[17]
        c00000000294d598 0000000000000000    gpr[18]    gpr[19]
        0000000000000000 0000000000001ff8    gpr[20]    gpr[21]
        0000000000000000 c00000000206d608    gpr[22]    gpr[23]
        c00000000278e0cc 0000000000000000    gpr[24]    gpr[25]
        000000002fff0000 c000000000000000    gpr[26]    gpr[27]
        0000000002000000 0000000000000028    gpr[28]    gpr[29]
        000000001db60000 0000000004750000    gpr[30]    gpr[31]
        0000000002000000 000000001db60000    nip        msr
        0000000000000000 0000000000000000    orig_r3    ctr
        c00000000000c49c 0000000000000000    link       xer
        0000000000000000 0000000000000000    ccr        softe
        0000000000000000 0000000000000000    trap       dar
        0000000000000000 0000000000000000    dsisr      result
        0000000000000000 0000000000000000    ppr        kuap
        0000000000000000 0000000000000000    pad[2]     pad[3]
      
      This looks suspiciously like stack frames, not a pt_regs. If we look
      closely we can see return addresses from the stack trace above,
      c000000002004820 (start_kernel) and c00000000000c49c (start_here_common).
      
      init_task->thread.regs is setup at build time in processor.h:
      
        #define INIT_THREAD  { \
        	.ksp = INIT_SP, \
        	.regs = (struct pt_regs *)INIT_SP - 1, /* XXX bogus, I think */ \
      
      The early boot code where we setup the initial stack is:
      
        LOAD_REG_ADDR(r3,init_thread_union)
      
        /* set up a stack pointer */
        LOAD_REG_IMMEDIATE(r1,THREAD_SIZE)
        add	r1,r3,r1
        li	r0,0
        stdu	r0,-STACK_FRAME_OVERHEAD(r1)
      
      Which creates a stack frame of size 112 bytes (STACK_FRAME_OVERHEAD).
      Which is far too small to contain a pt_regs.
      
      So the result is init_task->thread.regs is pointing at some stack
      frames on the init stack, not at a pt_regs.
      
      We have gotten away with this for so long because with pt_regs at its
      current size the MSR happens to point into the first frame, at a
      location that is not written to by the early asm. With the 16 byte
      expansion the MSR falls into the second frame, which is used by the
      compiler, and collides with a saved register that tends to be
      non-zero.
      
      As far as I can see this has been wrong since the original merge of
      64-bit ppc support, back in 2002.
      
      Conceptually swapper should have no regs, it never entered from
      userspace, and in fact that's what we do on 32-bit. It's also
      presumably what the "bogus" comment is referring to.
      
      So I think the right fix is to just not-initialise regs at all. I'm
      slightly worried this will break some code that isn't prepared for a
      NULL regs, but we'll have to see.
      
      Remove the comment in head_64.S which refers to us setting up the
      regs (even though we never did), and is otherwise not really accurate
      any more.
      Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200428123130.73078-1-mpe@ellerman.id.au
      7ffa8b7d
    • Gustavo A. R. Silva's avatar
      powerpc/mm: Replace zero-length array with flexible-array · 02bddf21
      Gustavo A. R. Silva authored
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      sizeof(flexible-array-member) triggers a warning because flexible array
      members have incomplete type[1]. There are some instances of code in
      which the sizeof operator is being incorrectly/erroneously applied to
      zero-length arrays and the result is zero. Such instances may be hiding
      some bugs. So, this work (flexible-array member conversions) will also
      help to get completely rid of those sorts of issues.
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200507185755.GA15014@embeddedor
      02bddf21
    • Gustavo A. R. Silva's avatar
      powerpc: Replace zero-length array with flexible-array · 0f6be41c
      Gustavo A. R. Silva authored
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      sizeof(flexible-array-member) triggers a warning because flexible array
      members have incomplete type[1]. There are some instances of code in
      which the sizeof operator is being incorrectly/erroneously applied to
      zero-length arrays and the result is zero. Such instances may be hiding
      some bugs. So, this work (flexible-array member conversions) will also
      help to get completely rid of those sorts of issues.
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200507185749.GA14994@embeddedor
      0f6be41c
    • Nicholas Piggin's avatar
      powerpc: Use trap metadata to prevent double restart rather than zeroing trap · 4e0e45b0
      Nicholas Piggin authored
      It's not very nice to zero trap for this, because then system calls no
      longer have trap_is_syscall(regs) invariant, and we can't distinguish
      between sc and scv system calls (in a later patch).
      
      Take one last unused bit from the low bits of the pt_regs.trap word
      for this instead. There is not a really good reason why it should be
      in trap as opposed to another field, but trap has some concept of
      flags and it exists. Ideally I think we would move trap to 2-byte
      field and have 2 more bytes available independently.
      
      Add a selftests case for this, which can be seen to fail if
      trap_norestart() is changed to return false.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Make them static inlines]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200507121332.2233629-4-mpe@ellerman.id.au
      4e0e45b0
    • Nicholas Piggin's avatar
      powerpc: trap_is_syscall() helper to hide syscall trap number · 912237ea
      Nicholas Piggin authored
      A new system call interrupt will be added with a new trap number.
      Hide the explicit 0xc00 test behind an accessor to reduce churn
      in callers.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Make it a static inline]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200507121332.2233629-3-mpe@ellerman.id.au
      912237ea
    • Nicholas Piggin's avatar
      powerpc: Use set_trap() and avoid open-coding trap masking · db30144b
      Nicholas Piggin authored
      The pt_regs.trap field keeps 4 low bits for some metadata about the
      trap or how it was handled, which is masked off in order to test the
      architectural trap number.
      
      Add a set_trap() accessor to set this, equivalent to TRAP() for
      returning it. This is actually not quite the equivalent of TRAP()
      because it always clears the low bits, which may be harmless if
      it can only be updated via ptrace syscall, but it seems dangerous.
      
      In fact settting TRAP from ptrace doesn't seem like a great idea
      so maybe it's better deleted.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Make it a static inline rather than a shouty macro]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200507121332.2233629-2-mpe@ellerman.id.au
      db30144b
    • Nicholas Piggin's avatar
  2. 11 May, 2020 10 commits
  3. 06 May, 2020 1 commit
    • Michael Ellerman's avatar
      Merge the lockless page table walk rework into next · 1f12096a
      Michael Ellerman authored
      This merges the lockless page table walk rework series from Aneesh.
      Because it touches powerpc KVM code we are sharing it with the kvm-ppc
      tree in our topic/ppc-kvm branch.
      
      This is the cover letter from Aneesh:
      
      Avoid IPI while updating page table entries.
      
      Problem Summary:
      Slow termination of KVM guest with large guest RAM config due to a
      large number of IPIs that were caused by clearing level 1 PTE
      entries (THP) entries. This is shown in the stack trace below.
      
      - qemu-system-ppc  [kernel.vmlinux]            [k] smp_call_function_many
         - smp_call_function_many
            - 36.09% smp_call_function_many
                 serialize_against_pte_lookup
                 radix__pmdp_huge_get_and_clear
                 zap_huge_pmd
                 unmap_page_range
                 unmap_vmas
                 unmap_region
                 __do_munmap
                 __vm_munmap
                 sys_munmap
                system_call
                 __munmap
                 qemu_ram_munmap
                 qemu_anon_ram_free
                 reclaim_ramblock
                 call_rcu_thread
                 qemu_thread_start
                 start_thread
                 __clone
      
      Why we need to do IPI when clearing PMD entries:
      This was added as part of commit: 13bd817b ("powerpc/thp: Serialize pmd clear against a linux page table walk")
      
      serialize_against_pte_lookup makes sure that all parallel lockless
      page table walk completes before we convert a PMD pte entry to regular
      pmd entry. We end up doing that conversion in the below scenarios
      
      1) __split_huge_zero_page_pmd
      2) do_huge_pmd_wp_page_fallback
      3) MADV_DONTNEED running parallel to page faults.
      
      local_irq_disable and lockless page table walk:
      
      The lockless page table walk work with the assumption that we can
      dereference the page table contents without holding a lock. For this
      to work, we need to make sure we read the page table contents
      atomically and page table pages are not going to be freed/released
      while we are walking the table pages. We can achieve by using a rcu
      based freeing for page table pages or if the architecture implements
      broadcast tlbie, we can block the IPI as we walk the page table pages.
      
      To support both the above framework, lockless page table walk is done
      with irq disabled instead of rcu_read_lock()
      
      We do have two interface for lockless page table walk, gup fast and
      __find_linux_pte. This patch series makes __find_linux_pte table walk
      safe against the conversion of PMD PTE to regular PMD.
      
      gup fast:
      
      gup fast is already safe against THP split because kernel now
      differentiate between a pmd split and a compound page split. gup fast
      can run parallel to a pmd split and we prevent a parallel gup fast to
      a hugepage split, by freezing the page refcount and failing the
      speculative page ref increment.
      
      Similar to how gup is safe against parallel pmd split, this patch
      series updates the __find_linux_pte callers to be safe against a
      parallel pmd split. We do that by enforcing the following rules.
      
      1) Don't reload the pte value, because that can be updated in
         parallel.
      2) Code should be able to work with a stale PTE value and not the
         recent one. ie, the pte value that we are looking at may not be the
         latest value in the page table.
      3) Before looking at pte value check for _PAGE_PTE bit. We now do this
      as part of pte_present() check.
      
      Performance:
      
      This speeds up Qemu guest RAM del/unplug time as below
      128 core, 496GB guest:
      
      Without patch:
        munmap start: timer = 13162 ms, PID=7684
        munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms
      
      With patch (upto removing IPI)
        munmap start: timer = 196449 ms, PID=6681
        munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms
      
      With patch (with adding the tlb invalidate in pmdp_huge_get_and_clear_full)
        munmap start: timer = 196345 ms, PID=6879
        munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms
      
      Link: https://lore.kernel.org/r/20200505071729.54912-1-aneesh.kumar@linux.ibm.com
      1f12096a
  4. 05 May, 2020 20 commits