Commits · 010a6cc4a6431093afec6e1611310f9a0f4433fe · Kirill Smelkov / linux

06 Feb, 2017 26 commits

KVM: MIPS: Make ERET handle ERL before EXL · 010a6cc4

James Hogan authored Oct 25, 2016

commit ede5f3e7 upstream.

The ERET instruction to return from exception is used for returning from
exception level (Status.EXL) and error level (Status.ERL). If both bits
are set however we should be returning from ERL first, as ERL can
interrupt EXL, for example when an NMI is taken. KVM however checks EXL
first.

Fix the order of the checks to match the pseudocode in the instruction
set manual.

Fixes: e685c689 ("KVM/MIPS32: Privileged instruction/target branch emulation.")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄmÃ¡Å" <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

010a6cc4

KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write · a0753a8c

Radim KrÄ�mÃ¡Å� authored Aug 08, 2016

commit dccbfcf5 upstream.

If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the
write with vmcs02 as the current VMCS.
This will incorrectly apply modifications intended for vmcs01 to vmcs02
and L2 can use it to gain access to L0's x2APIC registers by disabling
virtualized x2APIC while using msr bitmap that assumes enabled.

Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the
current VMCS.  An alternative solution would temporarily make vmcs01 the
current VMCS, but it requires more care.

Fixes: 8d14695f ("x86, apicv: add virtual x2apic support")
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim KrÄmÃ¡Å <rkrcmar@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

a0753a8c

KVM: MIPS: Drop other CPU ASIDs on guest MMU changes · 4c7055f8

James Hogan authored Nov 09, 2016

commit 91e4f1b6 upstream.

When a guest TLB entry is replaced by TLBWI or TLBWR, we only invalidate
TLB entries on the local CPU. This doesn't work correctly on an SMP host
when the guest is migrated to a different physical CPU, as it could pick
up stale TLB mappings from the last time the vCPU ran on that physical
CPU.

Therefore invalidate both user and kernel host ASIDs on other CPUs,
which will cause new ASIDs to be generated when it next runs on those
CPUs.

We're careful only to do this if the TLB entry was already valid, and
only for the kernel ASID where the virtual address it mapped is outside
of the guest user address range.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄmÃ¡Å" <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org> # 3.10.x-
Cc: Jiri Slaby <jslaby@suse.cz>
[james.hogan@imgtec.com: Backport to 3.10..3.16]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

4c7055f8

KVM: MIPS: Precalculate MMIO load resume PC · 1d3b3d31

James Hogan authored Nov 09, 2016

commit e1e575f6 upstream.

The advancing of the PC when completing an MMIO load is done before
re-entering the guest, i.e. before restoring the guest ASID. However if
the load is in a branch delay slot it may need to access guest code to
read the prior branch instruction. This isn't safe in TLB mapped code at
the moment, nor in the future when we'll access unmapped guest segments
using direct user accessors too, as it could read the branch from host
user memory instead.

Therefore calculate the resume PC in advance while we're still in the
right context and save it in the new vcpu->arch.io_pc (replacing the no
longer needed vcpu->arch.pending_load_cause), and restore it on MMIO
completion.

Fixes: e685c689 ("KVM/MIPS32: Privileged instruction/target branch emulation.")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄmÃ¡Å <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org> # 3.10.x-3.16.x: 5f508c43: MIPS: KVM: Fix unused variable build warning
Cc: <stable@vger.kernel.org> # 3.10.x-3.16.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[james.hogan@imgtec.com: Backport to 3.10..3.16]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

1d3b3d31

MIPS: KVM: Fix unused variable build warning · 63b0fa17

Nicholas Mc Guire authored Nov 09, 2016

commit 5f508c43 upstream.

As kvm_mips_complete_mmio_load() did not yet modify PC at this point
as James Hogans <james.hogan@imgtec.com> explained the curr_pc variable
and the comments along with it can be dropped.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Link: http://lkml.org/lkml/2015/5/8/422
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: kvm@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/9993/Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[james.hogan@imgtec.com: Backport to 3.10..3.16]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

63b0fa17

crypto: gcm - Fix IV buffer size in crypto_gcm_setkey · 4d57910e

Ondrej MosnÃ¡Ä�ek authored Sep 23, 2016

commit 50d2e6dc upstream.

The cipher block size for GCM is 16 bytes, and thus the CTR transform
used in crypto_gcm_setkey() will also expect a 16-byte IV. However,
the code currently reserves only 8 bytes for the IV, causing
an out-of-bounds access in the CTR transform. This patch fixes
the issue by setting the size of the IV buffer to 16 bytes.

Fixes: 84c91152 ("[CRYPTO] gcm: Add support for async ciphers")
Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>

4d57910e

crypto: skcipher - Fix blkcipher walk OOM crash · 0b3918fc

Herbert Xu authored Oct 27, 2016

commit acdb04d0 upstream.

When we need to allocate a temporary blkcipher_walk_next and it
fails, the code is supposed to take the slow path of processing
the data block by block.  However, due to an unrelated change
we instead end up dereferencing the NULL pointer.

This patch fixes it by moving the unrelated bsize setting out
of the way so that we enter the slow path as inteded.

Fixes: 7607bd8f ("[CRYPTO] blkcipher: Added blkcipher_walk_virt_block")
Cc: stable@vger.kernel.org
Reported-by: xiakaixu <xiakaixu@huawei.com>
Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

0b3918fc

crypto: cryptd - initialize child shash_desc on import · 6ff69ad9

Ard Biesheuvel authored Oct 27, 2016

commit 0bd22235 upstream.

When calling .import() on a cryptd ahash_request, the structure members
that describe the child transform in the shash_desc need to be initialized
like they are when calling .init()

Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

6ff69ad9

crypto: algif_skcipher - Load TX SG list after waiting · ca46046c

Herbert Xu authored Oct 27, 2016

commit 4f0414e5 upstream.

We need to load the TX SG list in sendmsg(2) after waiting for
incoming data, not before.

Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

ca46046c

crypto: algif_skcipher - Fix race condition in skcipher_check_key · 0ac80b6e

Herbert Xu authored Oct 27, 2016

commit 1822793a upstream.

We need to lock the child socket in skcipher_check_key as otherwise
two simultaneous calls can cause the parent socket to be freed.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

0ac80b6e

crypto: algif_hash - Fix race condition in hash_check_key · 54541dc3

Herbert Xu authored Oct 27, 2016

commit ad46d7e3 upstream.

We need to lock the child socket in hash_check_key as otherwise
two simultaneous calls can cause the parent socket to be freed.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

54541dc3

crypto: af_alg - Forbid bind(2) when nokey child sockets are present · fb0b50cf

Herbert Xu authored Oct 27, 2016

commit a6a48c56 upstream.

This patch forbids the calling of bind(2) when there are child
sockets created by accept(2) in existence, even if they are created
on the nokey path.

This is needed as those child sockets have references to the tfm
object which bind(2) will destroy.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

fb0b50cf

crypto: algif_skcipher - Remove custom release parent function · 92954970

Herbert Xu authored Oct 27, 2016

commit d7b65aee upstream.

This patch removes the custom release parent function as the
generic af_alg_release_parent now works for nokey sockets too.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

92954970

crypto: algif_hash - Remove custom release parent function · 023fe7ba

Herbert Xu authored Oct 27, 2016

commit f1d84af1 upstream.

This patch removes the custom release parent function as the
generic af_alg_release_parent now works for nokey sockets too.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

023fe7ba

crypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path · 61d05831

Herbert Xu authored Oct 27, 2016

commit 6a935170 upstream.

This patch allows af_alg_release_parent to be called even for
nokey sockets.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

61d05831

crypto: algif_skcipher - Add key check exception for cipher_null · b5b7b2f8

Herbert Xu authored Oct 27, 2016

commit 6e8d8ecf upstream.

This patch adds an exception to the key check so that cipher_null
users may continue to use algif_skcipher without setting a key.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

b5b7b2f8

crypto: skcipher - Add crypto_skcipher_has_setkey · 16d40448

Herbert Xu authored Oct 27, 2016

commit a1383cd8 upstream.

This patch adds a way for skcipher users to determine whether a key
is required by a transform.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

16d40448

crypto: algif_hash - Require setkey before accept(2) · 85529752

Herbert Xu authored Oct 27, 2016

commit 6de62f15 upstream.

Hash implementations that require a key may crash if you use
them without setting a key.  This patch adds the necessary checks
so that if you do attempt to use them without a key that we return
-ENOKEY instead of proceeding.

This patch also adds a compatibility path to support old applications
that do acept(2) before setkey.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

85529752

crypto: shash - Fix has_key setting · 389378f7

Herbert Xu authored Oct 27, 2016

commit 00420a65 upstream.

The has_key logic is wrong for shash algorithms as they always
have a setkey function.  So we should instead be testing against
shash_no_setkey.

Fixes: a5596d63 ("crypto: hash - Add crypto_ahash_has_setkey")
Cc: stable@vger.kernel.org
Reported-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

389378f7

crypto: hash - Add crypto_ahash_has_setkey · 7040a842

Herbert Xu authored Oct 27, 2016

commit a5596d63 upstream.

This patch adds a way for ahash users to determine whether a key
is required by a crypto_ahash transform.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

7040a842

crypto: algif_skcipher - Add nokey compatibility path · 4ad8ff67

Herbert Xu authored Oct 27, 2016

commit a0fa2d03 upstream.

This patch adds a compatibility path to support old applications
that do acept(2) before setkey.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

4ad8ff67

crypto: af_alg - Add nokey compatibility path · a37d0897

Herbert Xu authored Oct 27, 2016

commit 37766586 upstream.

This patch adds a compatibility path to support old applications
that do acept(2) before setkey.

Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

a37d0897

crypto: af_alg - Disallow bind/setkey/... after accept(2) · 12514d46

Herbert Xu authored Oct 27, 2016

commit c840ac6a upstream.

Each af_alg parent socket obtained by socket(2) corresponds to a
tfm object once bind(2) has succeeded.  An accept(2) call on that
parent socket creates a context which then uses the tfm object.

Therefore as long as any child sockets created by accept(2) exist
the parent socket must not be modified or freed.

This patch guarantees this by using locks and a reference count
on the parent socket.  Any attempt to modify the parent socket will
fail with EBUSY.

Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

12514d46

crypto: algif_skcipher - Require setkey before accept(2) · b8867191

Herbert Xu authored Oct 27, 2016

commit dd504589 upstream.

Some cipher implementations will crash if you try to use them
without calling setkey first.  This patch adds a check so that
the accept(2) call will fail with -ENOKEY if setkey hasn't been
done on the socket yet.

Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

b8867191

sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() · c73d76df

Peter Zijlstra authored Oct 07, 2015

commit ecf7d01c upstream.

Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.

        CPU0                            CPU1

        set_current_state(...)

        <preempt_schedule>
          context_switch(X, Y)
            prepare_lock_switch(Y)
              Y->on_cpu = 1;
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

                                        try_to_wake_up(X)
                                          LOCK(p->pi_lock);

                                          t = X->on_cpu; // 0

          context_switch(Y, X)
            prepare_lock_switch(X)
              X->on_cpu = 1;
            finish_lock_switch(Y)
              store_release(Y->on_cpu, 0);
        </preempt_schedule>

        schedule();
          deactivate_task(X);
          X->on_rq = 0;

                                          if (X->on_rq) // false

                                          if (t) while (X->on_cpu)
                                            cpu_relax();

          context_switch(X, ..)
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

Avoid the load of X->on_cpu being hoisted over the X->on_rq load.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>

c73d76df

sched/core: Fix a race between try_to_wake_up() and a woken up task · 6a94ced1

Balbir Singh authored Sep 05, 2016

commit 135e8c92 upstream.

The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p->on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p->pi_lock)
   ..
   if (p->on_rq && ttwu_wakeup())
   ..
   while (p->on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>

6a94ced1

21 Oct, 2016 1 commit
- Linux 3.10.104 · 7828a965
  Willy Tarreau authored Oct 21, 2016
  
  7828a965
19 Oct, 2016 13 commits

mm: remove gup_flags FOLL_WRITE games from __get_user_pages() · 9691eac5

Linus Torvalds authored Oct 13, 2016

commit 19be0eaf upstream.

This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db9 ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f4 ("fix get_user_pages bug").

In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better).  The
s390 dirty bit was implemented in abf09bed ("s390/mm: implement
software dirty bits") which made it into v3.9.  Earlier kernels will
have to look at the page state itself.

Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.

To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.
Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask;
     s/faultin_page/__get_user_page]
Signed-off-by: Willy Tarreau <w@1wt.eu>

9691eac5

xen-netback: ref count shared rings · 4cebb475

Wei Liu authored Jul 11, 2014

... so that we can make sure the rings are not freed until all SKBs in
internal queues are consumed.

1. The VM is receiving packets through bonding + bridge + netback +
   netfront.
2. For some unknown reason at least one packet remains in the rx queue
   and is not delivered to the domU immediately by netback.
3. The VM finishes shutting down.
4. The shared ring between dom0 and domU is freed.
5. then xen-netback continues processing the pending requests and tries
   to put the packet into the now already released shared ring.

> XXXlan0: port 9(vif26.0) entered disabled state
> BUG: unable to handle kernel paging request at ffffc900108641d8
> IP: [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
> PGD 57e20067 PUD 57e21067 PMD 571a7067 PTE 0
> Oops: 0000 [#1] SMP
> ...
> CPU: 0 PID: 12587 Comm: netback/0 Not tainted 3.10.0-ucs58-amd64 #1 Debian 3.10.11-1.58.201405060908
> Hardware name: FUJITSU PRIMERGY BX620 S6/D3051, BIOS 080015 Rev.3C78.3051 07/22/2011
> task: ffff880004b067c0 ti: ffff8800561ec000 task.ti: ffff8800561ec000
> RIP: e030:[<ffffffffa04147dc>]  [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
> RSP: e02b:ffff8800561edce8  EFLAGS: 00010202
> RAX: ffffc900104adac0 RBX: ffff8800541e95c0 RCX: ffffc90010864000
> RDX: 000000000000003b RSI: 0000000000000000 RDI: ffff880040014380
> RBP: ffff8800570e6800 R08: 0000000000000000 R09: ffff880004799800
> R10: ffffffff813ca115 R11: ffff88005e4fdb08 R12: ffff880054e6f800
> R13: ffff8800561edd58 R14: ffffc900104a1000 R15: 0000000000000000
> FS:  00007f19a54a8700(0000) GS:ffff88005da00000(0000) knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffffc900108641d8 CR3: 0000000054cb3000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
>  ffff880004b06ba0 0000000000000000 ffff88005da13ec0 ffff88005da13ec0
>  0000000004b067c0 ffffc900104a8ac0 ffffc900104a1020 000000005da13ec0
>  0000000000000000 0000000000000001 ffffc900104a8ac0 ffffc900104adac0
> Call Trace:
>  [<ffffffff813ca32d>] ? _raw_spin_lock_irqsave+0x11/0x2f
>  [<ffffffffa0416033>] ? xen_netbk_kthread+0x174/0x841 [xen_netback]
>  [<ffffffff8105d373>] ? wake_up_bit+0x20/0x20
>  [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
>  [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
>  [<ffffffffa0415ebf>] ? xen_netbk_tx_build_gops+0xce8/0xce8 [xen_netback]
>  [<ffffffff8105ce1e>] ? kthread+0xab/0xb3
>  [<ffffffff81003638>] ? xen_end_context_switch+0xe/0x1c
>  [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
>  [<ffffffff813cfbfc>] ? ret_from_fork+0x7c/0xb0
>  [<ffffffff8105cd73>] ? kthread_freezable_should_stop+0x56/0x56
> Code: 8b b3 d0 00 00 00 48 8b bb d8 00 00 00 0f b7 74 37 02 89 70 08 eb 07 c7 40 08 00 00 00 00 89 d2 c7 40 04 00 00 00 00 48 83 c2 08 <0f> b7 34 d1 89 30 c7 44 24 60 00 00 00 00 8b 44 d1 04 89 44 24
> RIP  [<ffffffffa04147dc>] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback]
>  RSP <ffff8800561edce8>
> CR2: ffffc900108641d8

Track the shared ring buffer being unmapped and drop those packets.

Ref-count the rings as followed:
  map         -> set to 1
   start_xmit -> inc when queueing SKB to internal queue
   rx_action  -> dec after finishing processing a SKB
  unmap       -> dec and wait to be 0

Note that this is different from ref counting the vif structure itself.
Currently only guest Rx path is taken care of because that's where the
bug surfaced.

This bug doesn't exist in kernel >=3.12 as multi-queue support was added
there.

Link: <https://lists.xenproject.org/archives/html/xen-devel/2014-06/msg00818.html>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Philipp Hahn <hahn@univention.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Tested-by: Philipp Hahn <hahn@univention.de>
Signed-off-by: Willy Tarreau <w@1wt.eu>

4cebb475

security: let security modules use PTRACE_MODE_* with bitmasks · 72f0c114

Jann Horn authored Jan 20, 2016

commit 3dfb7d8c upstream.

It looks like smack and yama weren't aware that the ptrace mode
can have flags ORed into it - PTRACE_MODE_NOAUDIT until now, but
only for /proc/$pid/stat, and with the PTRACE_MODE_*CREDS patch,
all modes have flags ORed into them.
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: no smk_ptrace_mode() in 3.10]
Signed-off-by: Willy Tarreau <w@1wt.eu>

72f0c114

MIPS: KVM: Check for pfn noslot case · fea24c07

James Hogan authored Sep 15, 2016

commit ba913e4f upstream.

When mapping a page into the guest we error check using is_error_pfn(),
however this doesn't detect a value of KVM_PFN_NOSLOT, indicating an
error HVA for the page. This can only happen on MIPS right now due to
unusual memslot management (e.g. being moved / removed / resized), or
with an Enhanced Virtual Memory (EVA) configuration where the default
KVM_HVA_ERR_* and kvm_is_error_hva() definitions are unsuitable (fixed
in a later patch). This case will be treated as a pfn of zero, mapping
the first page of physical memory into the guest.

It would appear the MIPS KVM port wasn't updated prior to being merged
(in v3.10) to take commit 81c52c56 ("KVM: do not treat noslot pfn as
a error pfn") into account (merged v3.8), which converted a bunch of
is_error_pfn() calls to is_error_noslot_pfn(). Switch to using
is_error_noslot_pfn() instead to catch this case properly.

Fixes: 858dd5d4 ("KVM/MIPS32: MMU/TLB operations for the Guest.")
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄmÃ¡Å" <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[james.hogan@imgtec.com: Backport to v3.16.y]
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

fea24c07

mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED · afac3781

Andrea Arcangeli authored Feb 26, 2016

commit ad33bb04 upstream.

pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
introduced to locklessy (but atomically) detect when a pmd is a regular
(stable) pmd or when the pmd is unstable and can infinitely transition
from pmd_none() and pmd_trans_huge() from under us, while only holding
the mmap_sem for reading (for writing not).

While holding the mmap_sem only for reading, MADV_DONTNEED can run from
under us and so before we can assume the pmd to be a regular stable pmd
we need to compare it against pmd_none() and pmd_trans_huge() in an
atomic way, with pmd_trans_unstable().  The old pmd_trans_huge() left a
tiny window for a race.

Useful applications are unlikely to notice the difference as doing
MADV_DONTNEED concurrently with a page fault would lead to undefined
behavior.

[js] 3.12 backport: no pmd_devmap in 3.12 yet.

[akpm@linux-foundation.org: tidy up comment grammar/layout]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>

afac3781

ACPI / sysfs: fix error code in get_status() · 40772621

Dan Carpenter authored May 05, 2016

commit f18ebc21 upstream.

The problem with ornamental, do-nothing gotos is that they lead to
"forgot to set the error code" bugs.  We should be returning -EINVAL
here but we don't.  It leads to an uninitalized variable in
counter_show():

    drivers/acpi/sysfs.c:603 counter_show()
    error: uninitialized symbol 'status'.

Fixes: 1c8fce27 (ACPI: introduce drivers/acpi/sysfs.c)
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

40772621

staging: comedi: daqboard2000: bug fix board type matching code · 479c12a0

Ian Abbott authored Jun 29, 2016

commit 80e162ee upstream.

`daqboard2000_find_boardinfo()` is supposed to check if the
DaqBoard/2000 series model is supported, based on the PCI subvendor and
subdevice ID.  The current code is wrong as it is comparing the PCI
device's subdevice ID to an expected, fixed value for the subvendor ID.
It should be comparing the PCI device's subvendor ID to this fixed
value.  Correct it.

Fixes: 7e8401b2 ("staging: comedi: daqboard2000: add back
subsystem_device check")
Signed-off-by: Ian Abbott <abbotti@mev.co.uk>
Cc: <stable@vger.kernel.org> # 3.7+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>

479c12a0

crypto: nx - off by one bug in nx_of_update_msc() · c0a803c1

Dan Carpenter authored Jul 15, 2016

commit e514cc0a upstream.

The props->ap[] array is defined like this:

	struct alg_props ap[NX_MAX_FC][NX_MAX_MODE][3];

So we can see that if msc->fc and msc->mode are == to NX_MAX_FC or
NX_MAX_MODE then we're off by one.

Fixes: ae0222b7 ('powerpc/crypto: nx driver code supporting nx encryption')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Willy Tarreau <w@1wt.eu>

c0a803c1

megaraid_sas: Fix probing cards without io port · 40605334

Yinghai Lu authored Aug 05, 2016

commit e7f85168 upstream.

Found one megaraid_sas HBA probe fails,

[  187.235190] scsi host2: Avago SAS based MegaRAID driver
[  191.112365] megaraid_sas 0000:89:00.0: BAR 0: can't reserve [io  0x0000-0x00ff]
[  191.120548] megaraid_sas 0000:89:00.0: IO memory region busy!

and the card has resource like,
[  125.097714] pci 0000:89:00.0: [1000:005d] type 00 class 0x010400
[  125.104446] pci 0000:89:00.0: reg 0x10: [io  0x0000-0x00ff]
[  125.110686] pci 0000:89:00.0: reg 0x14: [mem 0xce400000-0xce40ffff 64bit]
[  125.118286] pci 0000:89:00.0: reg 0x1c: [mem 0xce300000-0xce3fffff 64bit]
[  125.125891] pci 0000:89:00.0: reg 0x30: [mem 0xce200000-0xce2fffff pref]

that does not io port resource allocated from BIOS, and kernel can not
assign one as io port shortage.

The driver is only looking for MEM, and should not fail.

It turns out megasas_init_fw() etc are using bar index as mask.  index 1
is used as mask 1, so that pci_request_selected_regions() is trying to
request BAR0 instead of BAR1.

Fix all related reference.

Fixes: b6d5d880 ("megaraid_sas: Use lowest memory bar for SR-IOV VF support")
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

40605334

aacraid: Check size values after double-fetch from user · 2c0ffdb8

Dave Carroll authored Aug 05, 2016

commit fa00c437 upstream.

In aacraid's ioctl_send_fib() we do two fetches from userspace, one the
get the fib header's size and one for the fib itself. Later we use the
size field from the second fetch to further process the fib. If for some
reason the size from the second fetch is different than from the first
fix, we may encounter an out-of- bounds access in aac_fib_send(). We
also check the sender size to insure it is not out of bounds. This was
reported in https://bugzilla.kernel.org/show_bug.cgi?id=116751 and was
assigned CVE-2016-6480.
Reported-by: Pengfei Wang <wpengfeinudt@gmail.com>
Fixes: 7c00ffa3 '[SCSI] 2.6 aacraid: Variable FIB size (updated patch)'
Cc: stable@vger.kernel.org
Signed-off-by: Dave Carroll <david.carroll@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

2c0ffdb8

PCI: Limit config space size for Netronome NFP4000 · af2e2346

Simon Horman authored Dec 11, 2015

commit c2e771b0 upstream.

Like the NFP6000, the NFP4000 as an erratum where reading/writing to PCI
config space addresses above 0x600 can cause the NFP to generate PCIe
completion timeouts.

Limit the NFP4000's PF's config space size to 0x600 bytes as is already
done for the NFP6000.

The NFP4000's VF is 0x6004 (PCI_DEVICE_ID_NETRONOME_NFP6000_VF), the same
device ID as the NFP6000's VF.  Thus, its config space is already limited
by the existing use of quirk_nfp6000().
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

af2e2346

PCI: Add Netronome NFP4000 PF device ID · 0c8b6132

Simon Horman authored Dec 11, 2015

commit 69874ec2 upstream.

Add the device ID for the PF of the NFP4000.  The device ID for the VF,
0x6003, is already present as PCI_DEVICE_ID_NETRONOME_NFP6000_VF.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

0c8b6132

PCI: Limit config space size for Netronome NFP6000 family · b0e39bc1

Jason S. McMullan authored Sep 30, 2015

commit 9f33a2ae upstream.

The NFP6000 has an erratum where reading/writing to PCI config space
addresses above 0x600 can cause the NFP to generate PCIe completion
timeouts.

Limit the NFP6000's config space size to 0x600 bytes.
Signed-off-by: Jason S. McMullan <jason.mcmullan@netronome.com>
[simon: edited changelog]
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

b0e39bc1