1. 04 May, 2015 11 commits
  2. 03 May, 2015 1 commit
  3. 30 Apr, 2015 28 commits
    • Gerald Schaefer's avatar
      mm/hugetlb: use pmd_page() in follow_huge_pmd() · 43ee6fcf
      Gerald Schaefer authored
      commit 97534127 upstream.
      
      Commit 61f77eda ("mm/hugetlb: reduce arch dependent code around
      follow_huge_*") broke follow_huge_pmd() on s390, where pmd and pte
      layout differ and using pte_page() on a huge pmd will return wrong
      results.  Using pmd_page() instead fixes this.
      
      All architectures that were touched by that commit have pmd_page()
      defined, so this should not break anything on other architectures.
      
      Fixes: 61f77eda "mm/hugetlb: reduce arch dependent code around follow_huge_*"
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>, Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      43ee6fcf
    • Len Brown's avatar
      sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power... · 257ccf0a
      Len Brown authored
      sched/idle/x86: Restore mwait_idle() to fix boot hangs, to improve power savings and to improve performance
      
      commit b253149b upstream.
      
      In Linux-3.9 we removed the mwait_idle() loop:
      
        69fb3676 ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")
      
      The reasoning was that modern machines should be sufficiently
      happy during the boot process using the default_idle() HALT
      loop, until cpuidle loads and either acpi_idle or intel_idle
      invoke the newer MWAIT-with-hints idle loop.
      
      But two machines reported problems:
      
       1. Certain Core2-era machines support MWAIT-C1 and HALT only.
          MWAIT-C1 is preferred for optimal power and performance.
          But if they support just C1, cpuidle never loads and
          so they use the boot-time default idle loop forever.
      
       2. Some laptops will boot-hang if HALT is used,
          but will boot successfully if MWAIT is used.
          This appears to be a hidden assumption in BIOS SMI,
          that is presumably valid on the proprietary OS
          where the BIOS was validated.
      
             https://bugzilla.kernel.org/show_bug.cgi?id=60770
      
      So here we effectively revert the patch above, restoring
      the mwait_idle() loop.  However, we don't bother restoring
      the idle=mwait cmdline parameter, since it appears to add
      no value.
      
      Maintainer notes:
      
        For 3.9, simply revert 69fb3676
        for 3.10, patch -F3 applies, fuzz needed due to __cpuinit use in
        context For 3.11, 3.12, 3.13, this patch applies cleanly
      Tested-by: default avatarMike Galbraith <bitbucket@online.de>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      Acked-by: default avatarMike Galbraith <bitbucket@online.de>
      Cc: <stable@vger.kernel.org> # 3.9+
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ian Malone <ibmalone@gmail.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/345254a551eb5a6a866e048d7ab570fd2193aca4.1389763084.git.len.brown@intel.com
      [ Ported to recent kernels. ]
      [ Mike: 3.10 backport ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      257ccf0a
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after extent_same ioctl · 2034243f
      Filipe Manana authored
      commit 113e8283 upstream.
      
      If we pass a length of 0 to the extent_same ioctl, we end up locking an
      extent range with a start offset greater then its end offset (if the
      destination file's offset is greater than zero). This results in a warning
      from extent_io.c:insert_state through the following call chain:
      
        btrfs_extent_same()
          btrfs_double_lock()
            lock_extent_range()
              lock_extent(inode->io_tree, offset, offset + len - 1)
                lock_extent_bits()
                  __set_extent_bit()
                    insert_state()
                      --> WARN_ON(end < start)
      
      This leads to an infinite loop when evicting the inode. This is the same
      problem that my previous patch titled
      "Btrfs: fix inode eviction infinite loop after cloning into it" addressed
      but for the extent_same ioctl instead of the clone ioctl.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2034243f
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after cloning into it · 4ab0f403
      Filipe Manana authored
      commit ccccf3d6 upstream.
      
      If we attempt to clone a 0 length region into a file we can end up
      inserting a range in the inode's extent_io tree with a start offset
      that is greater then the end offset, which triggers immediately the
      following warning:
      
      [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3914.620886] BTRFS: end < start 4095 4096
      (...)
      [ 3914.638093] Call Trace:
      [ 3914.638636]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3914.639620]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3914.640789]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3914.642041]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3914.643236]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3914.644441]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3914.645711]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3914.646914]  [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
      [ 3914.648058]  [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
      [ 3914.650105]  [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
      [ 3914.651361]  [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
      [ 3914.652761]  [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
      [ 3914.654128]  [<ffffffff811226dd>] ? might_fault+0x58/0xb5
      [ 3914.655320]  [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
      (...)
      [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---
      
      This later makes the inode eviction handler enter an infinite loop that
      keeps dumping the following warning over and over:
      
      [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3915.119913] BTRFS: end < start 4095 4096
      (...)
      [ 3915.137394] Call Trace:
      [ 3915.137913]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3915.139154]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3915.140316]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3915.141505]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3915.142709]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3915.143849]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3915.145120]  [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
      [ 3915.146352]  [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
      [ 3915.147565]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3915.148785]  [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
      [ 3915.149931]  [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
      [ 3915.151154]  [<ffffffff81168904>] evict+0xa0/0x148
      [ 3915.152094]  [<ffffffff811689e5>] dispose_list+0x39/0x43
      [ 3915.153081]  [<ffffffff81169564>] evict_inodes+0xdc/0xeb
      [ 3915.154062]  [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
      [ 3915.155193]  [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
      [ 3915.156274]  [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
      (...)
      [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---
      
      So just bail out of the clone ioctl if the length of the region to clone
      is zero, without locking any extent range, in order to prevent this issue
      (same behaviour as a pwrite with a 0 length for example).
      
      This is trivial to reproduce. For example, the steps for the test I just
      made for fstests:
      
        mkfs.btrfs -f SCRATCH_DEV
        mount SCRATCH_DEV $SCRATCH_MNT
      
        touch $SCRATCH_MNT/foo
        touch $SCRATCH_MNT/bar
      
        $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
        umount $SCRATCH_MNT
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4ab0f403
    • David Sterba's avatar
      btrfs: don't accept bare namespace as a valid xattr · a60903bf
      David Sterba authored
      commit 3c3b04d1 upstream.
      
      Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
      works:
      
       $ touch file
       $ setfattr -n user. -v 1 file
       $ getfattr -d file
      user.="1"
      
      ie. the missing attribute name after the namespace.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291Reported-by: default avatarWilliam Douglas <william.douglas@intel.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a60903bf
    • Filipe Manana's avatar
      Btrfs: fix log tree corruption when fs mounted with -o discard · fb5aac18
      Filipe Manana authored
      commit dcc82f47 upstream.
      
      While committing a transaction we free the log roots before we write the
      new super block. Freeing the log roots implies marking the disk location
      of every node/leaf (metadata extent) as pinned before the new super block
      is written. This is to prevent the disk location of log metadata extents
      from being reused before the new super block is written, otherwise we
      would have a corrupted log tree if before the new super block is written
      a crash/reboot happens and the location of any log tree metadata extent
      ended up being reused and rewritten.
      
      Even though we pinned the log tree's metadata extents, we were issuing a
      discard against them if the fs was mounted with the -o discard option,
      resulting in corruption of the log tree if a crash/reboot happened before
      writing the new super block - the next time the fs was mounted, during
      the log replay process we would find nodes/leafs of the log btree with
      a content full of zeroes, causing the process to fail and require the
      use of the tool btrfs-zero-log to wipeout the log tree (and all data
      previously fsynced becoming lost forever).
      
      Fix this by not doing a discard when pinning an extent. The discard will
      be done later when it's safe (after the new super block is committed) at
      extent-tree.c:btrfs_finish_extent_commit().
      
      Fixes: e688b725 (Btrfs: fix extent pinning bugs in the tree log)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fb5aac18
    • Eric Dumazet's avatar
      net: fix crash in build_skb() · f555cb3a
      Eric Dumazet authored
      [ Upstream commit 2ea2f62c ]
      
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f555cb3a
    • Eric Dumazet's avatar
      net: do not deplete pfmemalloc reserve · bdcfb6f2
      Eric Dumazet authored
      [ Upstream commit 79930f58 ]
      
      build_skb() should look at the page pfmemalloc status.
      If set, this means page allocator allocated this page in the
      expectation it would help to free other pages. Networking
      stack can do that only if skb->pfmemalloc is also set.
      
      Also, we must refrain using high order pages from the pfmemalloc
      reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
      them. Under memory pressure, using order-0 pages is probably the best
      strategy.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bdcfb6f2
    • Eric Dumazet's avatar
      tcp: avoid looping in tcp_send_fin() · 90a176dc
      Eric Dumazet authored
      [ Upstream commit 845704a5 ]
      
      Presence of an unbound loop in tcp_send_fin() had always been hard
      to explain when analyzing crash dumps involving gigantic dying processes
      with millions of sockets.
      
      Lets try a different strategy :
      
      In case of memory pressure, try to add the FIN flag to last packet
      in write queue, even if packet was already sent. TCP stack will
      be able to deliver this FIN after a timeout event. Note that this
      FIN being delivered by a retransmit, it also carries a Push flag
      given our current implementation.
      
      By checking sk_under_memory_pressure(), we anticipate that cooking
      many FIN packets might deplete tcp memory.
      
      In the case we could not allocate a packet, even with __GFP_WAIT
      allocation, then not sending a FIN seems quite reasonable if it allows
      to get rid of this socket, free memory, and not block the process from
      eventually doing other useful work.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      90a176dc
    • Eric Dumazet's avatar
      tcp: fix possible deadlock in tcp_send_fin() · 3644030f
      Eric Dumazet authored
      [ Upstream commit d83769a5 ]
      
      Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
      case a huge process is killed by OOM, and tcp_mem[2] is hit.
      
      To be able to free memory we need to make progress, so this
      patch allows FIN packets to not care about tcp_mem[2], if
      skb allocation succeeded.
      
      In a follow-up patch, we might abort tcp_send_fin() infinite loop
      in case TIF_MEMDIE is set on this thread, as memory allocator
      did its best getting extra memory already.
      
      This patch reverts d22e1537 ("tcp: fix tcp fin memory accounting")
      
      Fixes: d22e1537 ("tcp: fix tcp fin memory accounting")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3644030f
    • Sebastian Pöhn's avatar
      ip_forward: Drop frames with attached skb->sk · 70c3d4ae
      Sebastian Pöhn authored
      [ Upstream commit 2ab95749 ]
      
      Initial discussion was:
      [FYI] xfrm: Don't lookup sk_policy for timewait sockets
      
      Forwarded frames should not have a socket attached. Especially
      tw sockets will lead to panics later-on in the stack.
      
      This was observed with TPROXY assigning a tw socket and broken
      policy routing (misconfigured). As a result frame enters
      forwarding path instead of input. We cannot solve this in
      TPROXY as it cannot know that policy routing is broken.
      
      v2:
      Remove useless comment
      Signed-off-by: default avatarSebastian Poehn <sebastian.poehn@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      70c3d4ae
    • Jiri Slaby's avatar
      Linux 3.12.42 · 6d4397ac
      Jiri Slaby authored
      6d4397ac
    • Christoffer Dall's avatar
      arm/arm64: KVM: Keep elrsr/aisr in sync with software model · b5b655c9
      Christoffer Dall authored
      commit ae705930 upstream.
      
      There is an interesting bug in the vgic code, which manifests itself
      when the KVM run loop has a signal pending or needs a vmid generation
      rollover after having disabled interrupts but before actually switching
      to the guest.
      
      In this case, we flush the vgic as usual, but we sync back the vgic
      state and exit to userspace before entering the guest.  The consequence
      is that we will be syncing the list registers back to the software model
      using the GICH_ELRSR and GICH_EISR from the last execution of the guest,
      potentially overwriting a list register containing an interrupt.
      
      This showed up during migration testing where we would capture a state
      where the VM has masked the arch timer but there were no interrupts,
      resulting in a hung test.
      
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Reported-by: default avatarAlex Bennee <alex.bennee@linaro.org>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b5b655c9
    • Marc Zyngier's avatar
      arm64: KVM: Do not use pgd_index to index stage-2 pgd · e83df6de
      Marc Zyngier authored
      commit 04b8dc85 upstream.
      
      The kernel's pgd_index macro is designed to index a normal, page
      sized array. KVM is a bit diffferent, as we can use concatenated
      pages to have a bigger address space (for example 40bit IPA with
      4kB pages gives us an 8kB PGD.
      
      In the above case, the use of pgd_index will always return an index
      inside the first 4kB, which makes a guest that has memory above
      0x8000000000 rather unhappy, as it spins forever in a page fault,
      whist the host happilly corrupts the lower pgd.
      
      The obvious fix is to get our own kvm_pgd_index that does the right
      thing(tm).
      
      Tested on X-Gene with a hacked kvmtool that put memory at a stupidly
      high address.
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e83df6de
    • Marc Zyngier's avatar
      arm64: KVM: Fix HCR setting for 32bit guests · d54c7a6d
      Marc Zyngier authored
      commit 801f6772 upstream.
      
      Commit b856a591 (arm/arm64: KVM: Reset the HCR on each vcpu
      when resetting the vcpu) moved the init of the HCR register to
      happen later in the init of a vcpu, but left out the fixup
      done in kvm_reset_vcpu when preparing for a 32bit guest.
      
      As a result, the 32bit guest is run as a 64bit guest, but the
      rest of the kernel still manages it as a 32bit. Fun follows.
      
      Moving the fixup to vcpu_reset_hcr solves the problem for good.
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d54c7a6d
    • Marc Zyngier's avatar
      arm64: KVM: Fix TLB invalidation by IPA/VMID · c387a64b
      Marc Zyngier authored
      commit 55e858b7 upstream.
      
      It took about two years for someone to notice that the IPA passed
      to TLBI IPAS2E1IS must be shifted by 12 bits. Clearly our reviewing
      is not as good as it should be...
      
      Paper bag time for me.
      Reported-by: default avatarMario Smarduch <m.smarduch@samsung.com>
      Tested-by: default avatarMario Smarduch <m.smarduch@samsung.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c387a64b
    • Christoffer Dall's avatar
      arm/arm64: KVM: Require in-kernel vgic for the arch timers · fd700e92
      Christoffer Dall authored
      commit 05971120 upstream.
      
      It is curently possible to run a VM with architected timers support
      without creating an in-kernel VGIC, which will result in interrupts from
      the virtual timer going nowhere.
      
      To address this issue, move the architected timers initialization to the
      time when we run a VCPU for the first time, and then only initialize
      (and enable) the architected timers if we have a properly created and
      initialized in-kernel VGIC.
      
      When injecting interrupts from the virtual timer to the vgic, the
      current setup should ensure that this never calls an on-demand init of
      the VGIC, which is the only call path that could return an error from
      kvm_vgic_inject_irq(), so capture the return value and raise a warning
      if there's an error there.
      
      We also change the kvm_timer_init() function from returning an int to be
      a void function, since the function always succeeds.
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fd700e92
    • Christoffer Dall's avatar
      arm/arm64: KVM: Don't allow creating VCPUs after vgic_initialized · 67ffa0e4
      Christoffer Dall authored
      commit 716139df upstream.
      
      When the vgic initializes its internal state it does so based on the
      number of VCPUs available at the time.  If we allow KVM to create more
      VCPUs after the VGIC has been initialized, we are likely to error out in
      unfortunate ways later, perform buffer overflows etc.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarEric Auger <eric.auger@linaro.org>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      67ffa0e4
    • Christoffer Dall's avatar
      arm/arm64: KVM: Introduce stage2_unmap_vm · fc234577
      Christoffer Dall authored
      commit 957db105 upstream.
      
      Introduce a new function to unmap user RAM regions in the stage2 page
      tables.  This is needed on reboot (or when the guest turns off the MMU)
      to ensure we fault in pages again and make the dcache, RAM, and icache
      coherent.
      
      Using unmap_stage2_range for the whole guest physical range does not
      work, because that unmaps IO regions (such as the GIC) which will not be
      recreated or in the best case faulted in on a page-by-page basis.
      
      Call this function on secondary and subsequent calls to the
      KVM_ARM_VCPU_INIT ioctl so that a reset VCPU will detect the guest
      Stage-1 MMU is off when faulting in pages and make the caches coherent.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fc234577
    • Christoffer Dall's avatar
      arm/arm64: KVM: Reset the HCR on each vcpu when resetting the vcpu · f5a186c1
      Christoffer Dall authored
      commit b856a591 upstream.
      
      When userspace resets the vcpu using KVM_ARM_VCPU_INIT, we should also
      reset the HCR, because we now modify the HCR dynamically to
      enable/disable trapping of guest accesses to the VM registers.
      
      This is crucial for reboot of VMs working since otherwise we will not be
      doing the necessary cache maintenance operations when faulting in pages
      with the guest MMU off.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f5a186c1
    • Christoffer Dall's avatar
      arm/arm64: KVM: Correct KVM_ARM_VCPU_INIT power off option · 96c7d3a6
      Christoffer Dall authored
      commit 3ad8b3de upstream.
      
      The implementation of KVM_ARM_VCPU_INIT is currently not doing what
      userspace expects, namely making sure that a vcpu which may have been
      turned off using PSCI is returned to its initial state, which would be
      powered on if userspace does not set the KVM_ARM_VCPU_POWER_OFF flag.
      
      Implement the expected functionality and clarify the ABI.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      96c7d3a6
    • Christoffer Dall's avatar
      arm/arm64: KVM: Don't clear the VCPU_POWER_OFF flag · 9177e8d7
      Christoffer Dall authored
      commit 03f1d4c1 upstream.
      
      If a VCPU was originally started with power off (typically to be brought
      up by PSCI in SMP configurations), there is no need to clear the
      POWER_OFF flag in the kernel, as this flag is only tested during the
      init ioctl itself.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9177e8d7
    • Ard Biesheuvel's avatar
      arm/arm64: kvm: drop inappropriate use of kvm_is_mmio_pfn() · 529ad12b
      Ard Biesheuvel authored
      commit 07a9748c upstream.
      
      Instead of using kvm_is_mmio_pfn() to decide whether a host region
      should be stage 2 mapped with device attributes, add a new static
      function kvm_is_device_pfn() that disregards RAM pages with the
      reserved bit set, as those should usually not be mapped as device
      memory.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      529ad12b
    • Geoff Levand's avatar
      arm64/kvm: Fix assembler compatibility of macros · 5030e3c0
      Geoff Levand authored
      commit 286fb1cc upstream.
      
      Some of the macros defined in kvm_arm.h are useful in assembly files, but are
      not compatible with the assembler.  Change any C language integer constant
      definitions using appended U, UL, or ULL to the UL() preprocessor macro.  Also,
      add a preprocessor include of the asm/memory.h file which defines the UL()
      macro.
      
      Fixes build errors like these when using kvm_arm.h in assembly
      source files:
      
        Error: unexpected characters following instruction at operand 3 -- `and x0,x1,#((1U<<25)-1)'
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarGeoff Levand <geoff@infradead.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5030e3c0
    • Steve Capper's avatar
      arm: kvm: STRICT_MM_TYPECHECKS fix for user_mem_abort · 8f4ca4f5
      Steve Capper authored
      commit 3d08c629 upstream.
      
      Commit:
      b8865767 ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping
      
      introduced some code in user_mem_abort that failed to compile if
      STRICT_MM_TYPECHECKS was enabled.
      
      This patch fixes up the failing comparison.
      Signed-off-by: default avatarSteve Capper <steve.capper@linaro.org>
      Reviewed-by: default avatarKim Phillips <kim.phillips@linaro.org>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8f4ca4f5
    • Christoffer Dall's avatar
      arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE · 91a82540
      Christoffer Dall authored
      commit c3058d5d upstream.
      
      When creating or moving a memslot, make sure the IPA space is within the
      addressable range of the guest.  Otherwise, user space can create too
      large a memslot and KVM would try to access potentially unallocated page
      table entries when inserting entries in the Stage-2 page tables.
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      91a82540
    • Vladimir Murzin's avatar
      arm: kvm: fix CPU hotplug · 296cffc0
      Vladimir Murzin authored
      commit 37a34ac1 upstream.
      
      On some platforms with no power management capabilities, the hotplug
      implementation is allowed to return from a smp_ops.cpu_die() call as a
      function return. Upon a CPU onlining event, the KVM CPU notifier tries
      to reinstall the hyp stub, which fails on platform where no reset took
      place following a hotplug event, with the message:
      
      CPU1: smp_ops.cpu_die() returned, trying to resuscitate
      CPU1: Booted secondary processor
      Kernel panic - not syncing: unexpected prefetch abort in Hyp mode at: 0x80409540
      unexpected data abort in Hyp mode at: 0x80401fe8
      unexpected HVC/SVC trap in Hyp mode at: 0x805c6170
      
      since KVM code is trying to reinstall the stub on a system where it is
      already configured.
      
      To prevent this issue, this patch adds a check in the KVM hotplug
      notifier that detects if the HYP stub really needs re-installing when a
      CPU is onlined and skips the installation call if the stub is already in
      place, which means that the CPU has not been reset.
      Signed-off-by: default avatarVladimir Murzin <vladimir.murzin@arm.com>
      Acked-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      296cffc0
    • Joel Schopp's avatar
      arm/arm64: KVM: Fix VTTBR_BADDR_MASK and pgd alloc · 628d3e68
      Joel Schopp authored
      commit dbff124e upstream.
      
      The current aarch64 calculation for VTTBR_BADDR_MASK masks only 39 bits
      and not all the bits in the PA range. This is clearly a bug that
      manifests itself on systems that allocate memory in the higher address
      space range.
      
       [ Modified from Joel's original patch to be based on PHYS_MASK_SHIFT
         instead of a hard-coded value and to move the alignment check of the
         allocation to mmu.c.  Also added a comment explaining why we hardcode
         the IPA range and changed the stage-2 pgd allocation to be based on
         the 40 bit IPA range instead of the maximum possible 48 bit PA range.
         - Christoffer ]
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarJoel Schopp <joel.schopp@amd.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarShannon Zhao <shannon.zhao@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      628d3e68