1. 07 Oct, 2015 7 commits
    • Mathias Nyman's avatar
      xhci: rework cycle bit checking for new dequeue pointers · 982fe712
      Mathias Nyman authored
      commit 365038d8 upstream.
      
      When we manually need to move the TR dequeue pointer we need to set the
      correct cycle bit as well. Previously we used the trb pointer from the
      last event received as a base, but this was changed in
      commit 1f81b6d2 ("usb: xhci: Prefer endpoint context dequeue pointer")
      to use the dequeue pointer from the endpoint context instead
      
      It turns out some Asmedia controllers advance the dequeue pointer
      stored in the endpoint context past the event triggering TRB, and
      this messed up the way the cycle bit was calculated.
      
      Instead of adding a quirk or complicating the already hard to follow cycle bit
      code, the whole cycle bit calculation is now simplified and adapted to handle
      event and endpoint context dequeue pointer differences.
      
      [js] do not touch find_trb_seg, it is still in use in other portions
           of the code
      
      Fixes: 1f81b6d2 ("usb: xhci: Prefer endpoint context dequeue pointer")
      Reported-by: default avatarMaciej Puzio <mx34567@gmail.com>
      Reported-by: default avatarEvan Langlois <uudruid74@gmail.com>
      Reviewed-by: default avatarJulius Werner <jwerner@chromium.org>
      Tested-by: default avatarMaciej Puzio <mx34567@gmail.com>
      Tested-by: default avatarEvan Langlois <uudruid74@gmail.com>
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      982fe712
    • Mathias Nyman's avatar
      xhci: Workaround for PME stuck issues in Intel xhci · 19c8c601
      Mathias Nyman authored
      commit b8cb91e0 upstream.
      
      The xhci in Intel Sunrisepoint and Cherryview platforms need a driver
      workaround for a Stuck PME that might either block PME events in suspend,
      or create spurious PME events preventing runtime suspend.
      
      Workaround is to clear a internal PME flag, BIT(28) in a vendor specific
      PMCTRL register at offset 0x80a4, in both suspend resume callbacks
      
      Without this, xhci connected usb devices might never be able to wake up the
      system from suspend, or prevent device from going to suspend (xhci d3)
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      19c8c601
    • Jan H. Schönherr's avatar
      sched: Fix cpu_active_mask/cpu_online_mask race · ff7987cd
      Jan H. Schönherr authored
      commit dd9d3843 upstream.
      
      There is a race condition in SMP bootup code, which may result
      in
      
          WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
          workqueue_cpu_up_callback()
      or
          kernel BUG at kernel/smpboot.c:135!
      
      It can be triggered with a bit of luck in Linux guests running
      on busy hosts.
      
      	CPU0                        CPUn
      	====                        ====
      
      	_cpu_up()
      	  __cpu_up()
      				    start_secondary()
      				      set_cpu_online()
      					cpumask_set_cpu(cpu,
      						   to_cpumask(cpu_online_bits));
      	  cpu_notify(CPU_ONLINE)
      	    <do stuff, see below>
      					cpumask_set_cpu(cpu,
      						   to_cpumask(cpu_active_bits));
      
      During the various CPU_ONLINE callbacks CPUn is online but not
      active. Several things can go wrong at that point, depending on
      the scheduling of tasks on CPU0.
      
      Variant 1:
      
        cpu_notify(CPU_ONLINE)
          workqueue_cpu_up_callback()
            rebind_workers()
              set_cpus_allowed_ptr()
      
        This call fails because it requires an active CPU; rebind_workers()
        ends with a warning:
      
          WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
          workqueue_cpu_up_callback()
      
      Variant 2:
      
        cpu_notify(CPU_ONLINE)
          smpboot_thread_call()
            smpboot_unpark_threads()
             ..
              __kthread_unpark()
                __kthread_bind()
                wake_up_state()
                 ..
                  select_task_rq()
                    select_fallback_rq()
      
        The ->wake_cpu of the unparked thread is not allowed, making a call
        to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
        find an allowed, active CPU and promptly resets the allowed CPUs, so
        that the task in question ends up on CPU0.
      
        When those unparked tasks are eventually executed, they run
        immediately into a BUG:
      
          kernel BUG at kernel/smpboot.c:135!
      
      Just changing the order in which the online/active bits are set
      (and adding some memory barriers), would solve the two issues
      above. However, it would change the order of operations back to
      the one before commit 6acbfb96 ("sched: Fix hotplug vs.
      set_cpus_allowed_ptr()"), thus, reintroducing that particular
      problem.
      
      Going further back into history, we have at least the following
      commits touching this topic:
      - commit 2baab4e9 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
      - commit 5fbd036b ("sched: Cleanup cpu_active madness")
      
      Together, these give us the following non-working solutions:
      
        - secondary CPU sets active before online, because active is assumed to
          be a subset of online;
      
        - secondary CPU sets online before active, because the primary CPU
          assumes that an online CPU is also active;
      
        - secondary CPU sets online and waits for primary CPU to set active,
          because it might deadlock.
      
      Commit 875ebe94 ("powerpc/smp: Wait until secondaries are
      active & online") introduces an arch-specific solution to this
      arch-independent problem.
      
      Now, go for a more general solution without explicit waiting and
      simply set active twice: once on the secondary CPU after online
      was set and once on the primary CPU after online was seen.
      
      set_cpus_allowed_ptr()")
      Signed-off-by: default avatarJan H. Schönherr <jschoenh@amazon.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Wilson <msw@amazon.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 6acbfb96 ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
      Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ff7987cd
    • Paul E. McKenney's avatar
      rcu: Reject memory-order-induced stall-warning false positives · 4438bb21
      Paul E. McKenney authored
      commit 26cdfedf upstream.
      
      If a system is idle from an RCU perspective for longer than specified
      by CONFIG_RCU_CPU_STALL_TIMEOUT, and if one CPU starts a grace period
      just as a second checks for CPU stalls, and if this second CPU happens
      to see the old value of rsp->jiffies_stall, it will incorrectly report a
      CPU stall.  This is quite rare, but apparently occurs deterministically
      on systems with about 6TB of memory.
      
      This commit therefore orders accesses to the data used to determine
      whether or not a CPU stall is in progress.  Grace-period initialization
      and cleanup first increments rsp->completed to mark the end of the
      previous grace period, then records the current jiffies in rsp->gp_start,
      then records the jiffies at which a stall can be expected to occur in
      rsp->jiffies_stall, and finally increments rsp->gpnum to mark the start
      of the new grace period.  Now, this ordering by itself does not prevent
      false positives.  For example, if grace-period initialization was delayed
      between recording rsp->gp_start and rsp->jiffies_stall, the CPU stall
      warning code might still see an old value of rsp->jiffies_stall.
      
      Therefore, this commit also orders the CPU stall warning accesses as
      well, loading rsp->gpnum and jiffies, then rsp->jiffies_stall, then
      rsp->gp_start, and finally rsp->completed.  This ordering means that
      the false-positive scenario in the previous paragraph would result
      in rsp->completed being greater than or equal to rsp->gpnum, which is
      never valid for a CPU stall, allowing the false positive to be rejected.
      Furthermore, any fetch that gets an old value of rsp->jiffies_stall
      must also get an old value of rsp->gpnum, which will again be rejected
      by the comparison of rsp->gpnum and rsp->completed.  Situations where
      rsp->gp_start is later than rsp->jiffies_stall are also rejected, as
      are situations where jiffies is less than rsp->jiffies_stall.
      
      Although use of unsynchronized accesses means that there are likely
      still some false-positive scenarios (synchronization has proven to be
      a very bad idea on large systems), this should get rid of a large class
      of these scenarios.
      Reported-by: default avatarFabian Herschel <fabian.herschel@suse.com>
      Reported-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Tested-by: default avatarJochen Striepe <jochen@tolot.escape.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4438bb21
    • Jan Kara's avatar
      jbd2: avoid infinite loop when destroying aborted journal · cd9683b6
      Jan Kara authored
      commit 841df7df upstream.
      
      Commit 6f6a6fda "jbd2: fix ocfs2 corrupt when updating journal
      superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
      when the journal is aborted. That makes logic in
      jbd2_log_do_checkpoint() bail out which is fine, except that
      jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
      a progress in cleaning the journal. Without it jbd2_journal_destroy()
      just loops in an infinite loop.
      
      Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
      jbd2_log_do_checkpoint() fails with error.
      Reported-by: default avatarEryu Guan <guaneryu@gmail.com>
      Tested-by: default avatarEryu Guan <guaneryu@gmail.com>
      Fixes: 6f6a6fdaSigned-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cd9683b6
    • Andy Lutomirski's avatar
      x86/paravirt: Replace the paravirt nop with a bona fide empty function · ad158ee3
      Andy Lutomirski authored
      commit fc57a7c6 upstream.
      
      PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an
      example, trimmed for readability):
      
          ff 15 00 00 00 00       callq  *0x0(%rip)        # 2796 <nmi+0x6>
                    2792: R_X86_64_PC32     pv_irq_ops+0x2c
      
      That's a call through a function pointer to regular C function that
      does nothing on native boots, but that function isn't protected
      against kprobes, isn't marked notrace, and is certainly not
      guaranteed to preserve any registers if the compiler is feeling
      perverse.  This is bad news for a CLBR_NONE operation.
      
      Of course, if everything works correctly, once paravirt ops are
      patched, it gets nopped out, but what if we hit this code before
      paravirt ops are patched in?  This can potentially cause breakage
      that is very difficult to debug.
      
      A more subtle failure is possible here, too: if _paravirt_nop uses
      the stack at all (even just to push RBP), it will overwrite the "NMI
      executing" variable if it's called in the NMI prologue.
      
      The Xen case, perhaps surprisingly, is fine, because it's already
      written in asm.
      
      Fix all of the cases that default to paravirt_nop (including
      adjust_exception_frame) with a big hammer: replace paravirt_nop with
      an asm function that is just a ret instruction.
      
      The Xen case may have other problems, so document them.
      
      This is part of a fix for some random crashes that Sasha saw.
      Reported-and-tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ad158ee3
    • Andy Lutomirski's avatar
      x86/nmi/64: Fix a paravirt stack-clobbering bug in the NMI code · a6f39d84
      Andy Lutomirski authored
      commit 83c133cf upstream.
      
      The NMI entry code that switches to the normal kernel stack needs to
      be very careful not to clobber any extra stack slots on the NMI
      stack.  The code is fine under the assumption that SWAPGS is just a
      normal instruction, but that assumption isn't really true.  Use
      SWAPGS_UNSAFE_STACK instead.
      
      This is part of a fix for some random crashes that Sasha saw.
      
      Fixes: 9b6e6a83 ("x86/nmi/64: Switch stacks on userspace NMI entry")
      Reported-and-tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Link: http://lkml.kernel.org/r/974bc40edffdb5c2950a5c4977f821a446b76178.1442791737.git.luto@kernel.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a6f39d84
  2. 05 Oct, 2015 9 commits
    • Jiri Slaby's avatar
      Linux 3.12.49 · c0007620
      Jiri Slaby authored
      c0007620
    • Paul Mackerras's avatar
      powerpc/MSI: Fix race condition in tearing down MSI interrupts · d5c47ccb
      Paul Mackerras authored
      commit e297c939 upstream.
      
      This fixes a race which can result in the same virtual IRQ number
      being assigned to two different MSI interrupts.  The most visible
      consequence of that is usually a warning and stack trace from the
      sysfs code about an attempt to create a duplicate entry in sysfs.
      
      The race happens when one CPU (say CPU 0) is disposing of an MSI
      while another CPU (say CPU 1) is setting up an MSI.  CPU 0 calls
      (for example) pnv_teardown_msi_irqs(), which calls
      msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its
      hardware IRQ number) is no longer in use.  Then, before CPU 0 gets
      to calling irq_dispose_mapping() to free up the virtal IRQ number,
      CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an
      MSI, and gets the same hardware IRQ number that CPU 0 just freed.
      CPU 1 then calls irq_create_mapping() to get a virtual IRQ number,
      which sees that there is currently a mapping for that hardware IRQ
      number and returns the corresponding virtual IRQ number (which is
      the same virtual IRQ number that CPU 0 was using).  CPU 0 then
      calls irq_dispose_mapping() and frees that virtual IRQ number.
      Now, if another CPU comes along and calls irq_create_mapping(), it
      is likely to get the virtual IRQ number that was just freed,
      resulting in the same virtual IRQ number apparently being used for
      two different hardware interrupts.
      
      To fix this race, we just move the call to msi_bitmap_free_hwirqs()
      to after the call to irq_dispose_mapping().  Since virq_to_hw()
      doesn't work for the virtual IRQ number after irq_dispose_mapping()
      has been called, we need to call it before irq_dispose_mapping() and
      remember the result for the msi_bitmap_free_hwirqs() call.
      
      The pattern of calling msi_bitmap_free_hwirqs() before
      irq_dispose_mapping() appears in 5 places under arch/powerpc, and
      appears to have originated in commit 05af7bd2 ("[POWERPC] MPIC
      U3/U4 MSI backend") from 2007.
      
      Fixes: 05af7bd2 ("[POWERPC] MPIC U3/U4 MSI backend")
      Reported-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d5c47ccb
    • Neil Brown's avatar
      md: flush ->event_work before stopping array. · e460fa8b
      Neil Brown authored
      commit ee5d004f upstream.
      
      The 'event_work' worker used by dm-raid may still be running
      when the array is stopped.  This can result in an oops.
      
      So flush the workqueue on which it is run after detaching
      and before destroying the device.
      Reported-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Fixes: 9d09e663 ("dm: raid456 basic support")
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e460fa8b
    • Eric W. Biederman's avatar
      vfs: Test for and handle paths that are unreachable from their mnt_root · 2ca1ae46
      Eric W. Biederman authored
      commit 397d425d upstream.
      
      In rare cases a directory can be renamed out from under a bind mount.
      In those cases without special handling it becomes possible to walk up
      the directory tree to the root dentry of the filesystem and down
      from the root dentry to every other file or directory on the filesystem.
      
      Like division by zero .. from an unconnected path can not be given
      a useful semantic as there is no predicting at which path component
      the code will realize it is unconnected.  We certainly can not match
      the current behavior as the current behavior is a security hole.
      
      Therefore when encounting .. when following an unconnected path
      return -ENOENT.
      
      - Add a function path_connected to verify path->dentry is reachable
        from path->mnt.mnt_root.  AKA to validate that rename did not do
        something nasty to the bind mount.
      
        To avoid races path_connected must be called after following a path
        component to it's next path component.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2ca1ae46
    • Eric W. Biederman's avatar
      dcache: Handle escaped paths in prepend_path · b32388c0
      Eric W. Biederman authored
      commit cde93be4 upstream.
      
      A rename can result in a dentry that by walking up d_parent
      will never reach it's mnt_root.  For lack of a better term
      I call this an escaped path.
      
      prepend_path is called by four different functions __d_path,
      d_absolute_path, d_path, and getcwd.
      
      __d_path only wants to see paths are connected to the root it passes
      in.  So __d_path needs prepend_path to return an error.
      
      d_absolute_path similarly wants to see paths that are connected to
      some root.  Escaped paths are not connected to any mnt_root so
      d_absolute_path needs prepend_path to return an error greater
      than 1.  So escaped paths will be treated like paths on lazily
      unmounted mounts.
      
      getcwd needs to prepend "(unreachable)" so getcwd also needs
      prepend_path to return an error.
      
      d_path is the interesting hold out.  d_path just wants to print
      something, and does not care about the weird cases.  Which raises
      the question what should be printed?
      
      Given that <escaped_path>/<anything> should result in -ENOENT I
      believe it is desirable for escaped paths to be printed as empty
      paths.  As there are not really any meaninful path components when
      considered from the perspective of a mount tree.
      
      So tweak prepend_path to return an empty path with an new error
      code of 3 when it encounters an escaped path.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b32388c0
    • Andy Lutomirski's avatar
      x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection · 864c198b
      Andy Lutomirski authored
      commit 810bc075 upstream.
      
      We have a tricky bug in the nested NMI code: if we see RSP
      pointing to the NMI stack on NMI entry from kernel mode, we
      assume that we are executing a nested NMI.
      
      This isn't quite true.  A malicious userspace program can point
      RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
      happen while RSP is still pointing at the NMI stack.
      
      Fix it with a sneaky trick.  Set DF in the region of code that
      the RSP check is intended to detect.  IRET will clear DF
      atomically.
      
      ( Note: other than paravirt, there's little need for all this
        complexity. We could check RIP instead of RSP. )
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      
      864c198b
    • Andy Lutomirski's avatar
      x86/nmi/64: Reorder nested NMI checks · 10c99c97
      Andy Lutomirski authored
      commit a27507ca upstream.
      
      Check the repeat_nmi .. end_repeat_nmi special case first.  The
      next patch will rework the RSP check and, as a side effect, the
      RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
      we'll need this ordering of the checks.
      
      Note: this is more subtle than it appears.  The check for
      repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
      instead of adjusting the "iret" frame to force a repeat.  This
      is necessary, because the code between repeat_nmi and
      end_repeat_nmi sets "NMI executing" and then writes to the
      "iret" frame itself.  If a nested NMI comes in and modifies the
      "iret" frame while repeat_nmi is also modifying it, we'll end up
      with garbage.  The old code got this right, as does the new
      code, but the new code is a bit more explicit.
      
      If we were to move the check right after the "NMI executing"
      check, then we'd get it wrong and have random crashes.
      
      ( Because the "NMI executing" check would jump to the code that would
        modify the "iret" frame without checking if the interrupted NMI was
        currently modifying it. )
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      10c99c97
    • Andy Lutomirski's avatar
      x86/nmi/64: Improve nested NMI comments · d2ad2032
      Andy Lutomirski authored
      commit 0b22930e upstream.
      
      I found the nested NMI documentation to be difficult to follow.
      Improve the comments.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d2ad2032
    • Wilson Kok's avatar
      fib_rules: fix fib rule dumps across multiple skbs · 737ea802
      Wilson Kok authored
      [ Upstream commit 41fc0143 ]
      
      dump_rules returns skb length and not error.
      But when family == AF_UNSPEC, the caller of dump_rules
      assumes that it returns an error. Hence, when family == AF_UNSPEC,
      we continue trying to dump on -EMSGSIZE errors resulting in
      incorrect dump idx carried between skbs belonging to the same dump.
      This results in fib rule dump always only dumping rules that fit
      into the first skb.
      
      This patch fixes dump_rules to return error so that we exit correctly
      and idx is correctly maintained between skbs that are part of the
      same dump.
      
      [rd] fix for <= 3.19: fib_nl_fill_rule() returns skb->len, make the
           check 'err < 0'
      Signed-off-by: default avatarWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      737ea802
  3. 30 Sep, 2015 24 commits
    • Jesse Gross's avatar
      openvswitch: Zero flows on allocation. · 3b27550c
      Jesse Gross authored
      [ Upstream commit ae5f2fb1 ]
      
      When support for megaflows was introduced, OVS needed to start
      installing flows with a mask applied to them. Since masking is an
      expensive operation, OVS also had an optimization that would only
      take the parts of the flow keys that were covered by a non-zero
      mask. The values stored in the remaining pieces should not matter
      because they are masked out.
      
      While this works fine for the purposes of matching (which must always
      look at the mask), serialization to netlink can be problematic. Since
      the flow and the mask are serialized separately, the uninitialized
      portions of the flow can be encoded with whatever values happen to be
      present.
      
      In terms of functionality, this has little effect since these fields
      will be masked out by definition. However, it leaks kernel memory to
      userspace, which is a potential security vulnerability. It is also
      possible that other code paths could look at the masked key and get
      uninitialized data, although this does not currently appear to be an
      issue in practice.
      
      This removes the mask optimization for flows that are being installed.
      This was always intended to be the case as the mask optimizations were
      really targetting per-packet flow operations.
      
      Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
      Signed-off-by: default avatarJesse Gross <jesse@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3b27550c
    • Marcelo Ricardo Leitner's avatar
      sctp: fix race on protocol/netns initialization · 540a0bd9
      Marcelo Ricardo Leitner authored
      [ Upstream commit 8e2d61e0 ]
      
      Consider sctp module is unloaded and is being requested because an user
      is creating a sctp socket.
      
      During initialization, sctp will add the new protocol type and then
      initialize pernet subsys:
      
              status = sctp_v4_protosw_init();
              if (status)
                      goto err_protosw_init;
      
              status = sctp_v6_protosw_init();
              if (status)
                      goto err_v6_protosw_init;
      
              status = register_pernet_subsys(&sctp_net_ops);
      
      The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
      is possible for userspace to create SCTP sockets like if the module is
      already fully loaded. If that happens, one of the possible effects is
      that we will have readers for net->sctp.local_addr_list list earlier
      than expected and sctp_net_init() does not take precautions while
      dealing with that list, leading to a potential panic but not limited to
      that, as sctp_sock_init() will copy a bunch of blank/partially
      initialized values from net->sctp.
      
      The race happens like this:
      
           CPU 0                           |  CPU 1
        socket()                           |
         __sock_create                     | socket()
          inet_create                      |  __sock_create
           list_for_each_entry_rcu(        |
              answer, &inetsw[sock->type], |
              list) {                      |   inet_create
            /* no hits */                  |
           if (unlikely(err)) {            |
            ...                            |
            request_module()               |
            /* socket creation is blocked  |
             * the module is fully loaded  |
             */                            |
             sctp_init                     |
              sctp_v4_protosw_init         |
               inet_register_protosw       |
                list_add_rcu(&p->list,     |
                             last_perm);   |
                                           |  list_for_each_entry_rcu(
                                           |     answer, &inetsw[sock->type],
              sctp_v6_protosw_init         |     list) {
                                           |     /* hit, so assumes protocol
                                           |      * is already loaded
                                           |      */
                                           |  /* socket creation continues
                                           |   * before netns is initialized
                                           |   */
              register_pernet_subsys       |
      
      Simply inverting the initialization order between
      register_pernet_subsys() and sctp_v4_protosw_init() is not possible
      because register_pernet_subsys() will create a control sctp socket, so
      the protocol must be already visible by then. Deferring the socket
      creation to a work-queue is not good specially because we loose the
      ability to handle its errors.
      
      So, as suggested by Vlad, the fix is to split netns initialization in
      two moments: defaults and control socket, so that the defaults are
      already loaded by when we register the protocol, while control socket
      initialization is kept at the same moment it is today.
      
      Fixes: 4db67e80 ("sctp: Make the address lists per network namespace")
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      540a0bd9
    • Daniel Borkmann's avatar
      netlink, mmap: transform mmap skb into full skb on taps · 92994a5f
      Daniel Borkmann authored
      [ Upstream commit 1853c949 ]
      
      Ken-ichirou reported that running netlink in mmap mode for receive in
      combination with nlmon will throw a NULL pointer dereference in
      __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
      to handle kernel paging request". The problem is the skb_clone() in
      __netlink_deliver_tap_skb() for skbs that are mmaped.
      
      I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
      skb has it pointed to netlink_skb_destructor(), set in the handler
      netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
      that in such cases, __kfree_skb() doesn't perform a skb_release_data()
      via skb_release_all(), where skb->head is possibly being freed through
      kfree(head) into slab allocator, although netlink mmap skb->head points
      to the mmap buffer. Similarly, the same has to be done also for large
      netlink skbs where the data area is vmalloced. Therefore, as discussed,
      make a copy for these rather rare cases for now. This fixes the issue
      on my and Ken-ichirou's test-cases.
      
      Reference: http://thread.gmane.org/gmane.linux.network/371129
      Fixes: bcbde0d4 ("net: netlink: virtual tap device management")
      Reported-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      92994a5f
    • Richard Laing's avatar
      net/ipv6: Correct PIM6 mrt_lock handling · 2b80121a
      Richard Laing authored
      [ Upstream commit 25b4a44c ]
      
      In the IPv6 multicast routing code the mrt_lock was not being released
      correctly in the MFC iterator, as a result adding or deleting a MIF would
      cause a hang because the mrt_lock could not be acquired.
      
      This fix is a copy of the code for the IPv4 case and ensures that the lock
      is released correctly.
      Signed-off-by: default avatarRichard Laing <richard.laing@alliedtelesis.co.nz>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2b80121a
    • Daniel Borkmann's avatar
      ipv6: fix exthdrs offload registration in out_rt path · d3c40560
      Daniel Borkmann authored
      [ Upstream commit e41b0bed ]
      
      We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
      but in error path, we try to unregister it with inet_del_offload(). This
      doesn't seem correct, it should actually be inet6_del_offload(), also
      ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
      also uses rthdr_offload twice), but it got removed entirely later on.
      
      Fixes: 3336288a ("ipv6: Switch to using new offload infrastructure.")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d3c40560
    • Eugene Shatokhin's avatar
      usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared · 3c09b8d5
      Eugene Shatokhin authored
      [ Upstream commit f50791ac ]
      
      It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
      usbnet_stop(), but its value should be read before it is cleared
      when dev->flags is set to 0.
      
      The problem was spotted and the fix was provided by
      Oliver Neukum <oneukum@suse.de>.
      Signed-off-by: default avatarEugene Shatokhin <eugene.shatokhin@rosalab.ru>
      Acked-by: default avatarOliver Neukum <oneukum@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3c09b8d5
    • huaibin Wang's avatar
      ip6_gre: release cached dst on tunnel removal · 48b478e1
      huaibin Wang authored
      [ Upstream commit d4257295 ]
      
      When a tunnel is deleted, the cached dst entry should be released.
      
      This problem may prevent the removal of a netns (seen with a x-netns IPv6
      gre tunnel):
        unregister_netdevice: waiting for lo to become free. Usage count = 3
      
      CC: Dmitry Kozlov <xeb@mail.ru>
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Signed-off-by: default avatarhuaibin Wang <huaibin.wang@6wind.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      48b478e1
    • Hin-Tak Leung's avatar
      hfs,hfsplus: cache pages correctly between bnode_create and bnode_free · d14ac6a4
      Hin-Tak Leung authored
      commit 7cb74be6 upstream.
      
      Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
      hfs_bnode_find() for finding or creating pages corresponding to an inode)
      are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
      and should not be page_cache_release()'ed until hfs_bnode_free().
      
      This patch fixes a problem I first saw in July 2012: merely running "du"
      on a large hfsplus-mounted directory a few times on a reasonably loaded
      system would get the hfsplus driver all confused and complaining about
      B-tree inconsistencies, and generates a "BUG: Bad page state".  Most
      recently, I can generate this problem on up-to-date Fedora 22 with shipped
      kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
      mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
      lightly-used QEMU VM image of the full Mac OS X 10.9:
      
      $ df -i / /home /mnt
      Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
      /dev/mapper/fedora-root    3276800  551665    2725135   17% /
      /dev/mapper/fedora-home   52879360  716221   52163139    2% /home
      /dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt
      
      After applying the patch, I was able to run "du /" (60+ times) and "du
      /mnt" (150+ times) continuously and simultaneously for 6+ hours.
      
      There are many reports of the hfsplus driver getting confused under load
      and generating "BUG: Bad page state" or other similar issues over the
      years.  [1]
      
      The unpatched code [2] has always been wrong since it entered the kernel
      tree.  The only reason why it gets away with it is that the
      kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
      the kernel has not had a chance to reuse the memory for something else,
      most of the time.
      
      The current RW driver appears to have followed the design and development
      of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
      2001) had a B-tree node-centric approach to
      read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
      migrating towards version 0.2 (June 2002) of caching and releasing pages
      per inode extents.  When the current RW code first entered the kernel [2]
      in 2005, there was an REF_PAGES conditional (and "//" commented out code)
      to switch between B-node centric paging to inode-centric paging.  There
      was a mistake with the direction of one of the REF_PAGES conditionals in
      __hfs_bnode_create().  In a subsequent "remove debug code" commit [4], the
      read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
      removed, but a page_cache_release() was mistakenly left in (propagating
      the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
      page_cache_release() in bnode_release() (which should be spanned by
      !REF_PAGES) was never enabled.
      
      References:
      [1]:
      Michael Fox, Apr 2013
      http://www.spinics.net/lists/linux-fsdevel/msg63807.html
      ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")
      
      Sasha Levin, Feb 2015
      http://lkml.org/lkml/2015/2/20/85 ("use after free")
      
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
      https://bugzilla.kernel.org/show_bug.cgi?id=42342
      https://bugzilla.kernel.org/show_bug.cgi?id=63841
      https://bugzilla.kernel.org/show_bug.cgi?id=78761
      
      [2]:
      http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
      fs/hfs/bnode.c?id=d1081202
      commit d1081202
      Author: Andrew Morton <akpm@osdl.org>
      Date:   Wed Feb 25 16:17:36 2004 -0800
      
          [PATCH] HFS rewrite
      
      http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
      fs/hfsplus/bnode.c?id=91556682
      
      commit 91556682
      Author: Andrew Morton <akpm@osdl.org>
      Date:   Wed Feb 25 16:17:48 2004 -0800
      
          [PATCH] HFS+ support
      
      [3]:
      http://sourceforge.net/projects/linux-hfsplus/
      
      http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
      http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/
      
      http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
      fs/hfsplus/bnode.c?r1=1.4&r2=1.5
      
      Date:   Thu Jun 6 09:45:14 2002 +0000
      Use buffer cache instead of page cache in bnode.c. Cache inode extents.
      
      [4]:
      http://git.kernel.org/cgit/linux/kernel/git/\
      stable/linux-stable.git/commit/?id=a5e3985f
      
      commit a5e3985f
      Author: Roman Zippel <zippel@linux-m68k.org>
      Date:   Tue Sep 6 15:18:47 2005 -0700
      
      [PATCH] hfs: remove debug code
      Signed-off-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Signed-off-by: default avatarSergei Antonov <saproj@gmail.com>
      Reviewed-by: default avatarAnton Altaparmakov <anton@tuxera.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Sougata Santra <sougata@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d14ac6a4
    • Alexey Brodkin's avatar
      stmmac: troubleshoot unexpected bits in des0 & des1 · 8769343a
      Alexey Brodkin authored
      commit f1590670 upstream.
      
      Current implementation of descriptor init procedure only takes
      care about setting/clearing ownership flag in "des0"/"des1"
      fields while it is perfectly possible to get unexpected bits
      set because of the following factors:
      
       [1] On driver probe underlying memory allocated with
           dma_alloc_coherent() might not be zeroed and so
           it will be filled with garbage.
      
       [2] During driver operation some bits could be set by SD/MMC
           controller (for example error flags etc).
      
      And unexpected and/or randomly set flags in "des0"/"des1"
      fields may lead to unpredictable behavior of GMAC DMA block.
      
      This change addresses both items above with:
      
       [1] Use of dma_zalloc_coherent() instead of simple
           dma_alloc_coherent() to make sure allocated memory is
           zeroed. That shouldn't affect performance because
           this allocation only happens once on driver probe.
      
       [2] Do explicit zeroing of both "des0" and "des1" fields
           of all buffer descriptors during initialization of
           DMA transfer.
      
      And while at it fixed identation of dma_free_coherent()
      counterpart as well.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: arc-linux-dev@synopsys.com
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8769343a
    • Noa Osherovich's avatar
      IB/mlx4: Use correct SL on AH query under RoCE · 98b7c921
      Noa Osherovich authored
      commit 5e99b139 upstream.
      
      The mlx4 IB driver implementation for ib_query_ah used a wrong offset
      (28 instead of 29) when link type is Ethernet. Fixed to use the correct one.
      
      Fixes: fa417f7b ('IB/mlx4: Add support for IBoE')
      Signed-off-by: default avatarShani Michaeli <shanim@mellanox.com>
      Signed-off-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      98b7c921
    • Jack Morgenstein's avatar
      IB/mlx4: Forbid using sysfs to change RoCE pkeys · 36a2ea3b
      Jack Morgenstein authored
      commit 2b135db3 upstream.
      
      The pkey mapping for RoCE must remain the default mapping:
      VFs:
        virtual index 0 = mapped to real index 0 (0xFFFF)
        All others indices: mapped to a real pkey index containing an
                            invalid pkey.
      PF:
        virtual index i = real index i.
      
      Don't allow users to change these mappings using files found in
      sysfs.
      
      Fixes: c1e7e466 ('IB/mlx4: Add iov directory in sysfs under the ib device')
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      36a2ea3b
    • Yishai Hadas's avatar
      IB/uverbs: Fix race between ib_uverbs_open and remove_one · d59a6097
      Yishai Hadas authored
      commit 35d4a0b6 upstream.
      
      Fixes: 2a72f212 ("IB/uverbs: Remove dev_table")
      
      Before this commit there was a device look-up table that was protected
      by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When
      it was dropped and container_of was used instead, it enabled the race
      with remove_one as dev might be freed just after:
      dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev) but
      before the kref_get.
      
      In addition, this buggy patch added some dead code as
      container_of(x,y,z) can never be NULL and so dev can never be NULL.
      As a result the comment above ib_uverbs_open saying "the open method
      will either immediately run -ENXIO" is wrong as it can never happen.
      
      The solution follows Jason Gunthorpe suggestion from below URL:
      https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html
      
      cdev will hold a kref on the parent (the containing structure,
      ib_uverbs_device) and only when that kref is released it is
      guaranteed that open will never be called again.
      
      In addition, fixes the active count scheme to use an atomic
      not a kref to prevent WARN_ON as pointed by above comment
      from Jason.
      Signed-off-by: default avatarYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: default avatarShachar Raindel <raindel@mellanox.com>
      Reviewed-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d59a6097
    • Christoph Hellwig's avatar
      IB/uverbs: reject invalid or unknown opcodes · 6c1375b0
      Christoph Hellwig authored
      commit b632ffa7 upstream.
      
      We have many WR opcodes that are only supported in kernel space
      and/or require optional information to be copied into the WR
      structure.  Reject all those not explicitly handled so that we
      can't pass invalid information to drivers.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6c1375b0
    • Mike Marciniszyn's avatar
      IB/qib: Change lkey table allocation to support more MRs · 9b0c3ad1
      Mike Marciniszyn authored
      commit d6f1c17e upstream.
      
      The lkey table is allocated with with a get_user_pages() with an
      order based on a number of index bits from a module parameter.
      
      The underlying kernel code cannot allocate that many contiguous pages.
      
      There is no reason the underlying memory needs to be physically
      contiguous.
      
      This patch:
      - switches the allocation/deallocation to vmalloc/vfree
      - caps the number of bits to 23 to insure at least 1 generation bit
        o this matches the module parameter description
      Reviewed-by: default avatarVinit Agnihotri <vinit.abhay.agnihotri@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9b0c3ad1
    • Hin-Tak Leung's avatar
      hfs: fix B-tree corruption after insertion at position 0 · 5eb8a3fc
      Hin-Tak Leung authored
      commit b4cc0efe upstream.
      
      Fix B-tree corruption when a new record is inserted at position 0 in the
      node in hfs_brec_insert().
      
      This is an identical change to the corresponding hfs b-tree code to Sergei
      Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
      to keep similar code paths in the hfs and hfsplus drivers in sync, where
      appropriate.
      Signed-off-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Cc: Sergei Antonov <saproj@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Reviewed-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5eb8a3fc
    • David Vrabel's avatar
      xen/gntdev: convert priv->lock to a mutex · 2081451c
      David Vrabel authored
      commit 1401c00e upstream.
      
      Unmapping may require sleeping and we unmap while holding priv->lock, so
      convert it to a mutex.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <ian.campbell@citrix.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2081451c
    • NeilBrown's avatar
      md/raid10: always set reshape_safe when initializing reshape_position. · c7af5eb4
      NeilBrown authored
      commit 299b0685 upstream.
      
      'reshape_position' tracks where in the reshape we have reached.
      'reshape_safe' tracks where in the reshape we have safely recorded
      in the metadata.
      
      These are compared to determine when to update the metadata.
      So it is important that reshape_safe is initialised properly.
      Currently it isn't.  When starting a reshape from the beginning
      it usually has the correct value by luck.  But when reducing the
      number of devices in a RAID10, it has the wrong value and this leads
      to the metadata not being updated correctly.
      This can lead to corruption if the reshape is not allowed to complete.
      
      This patch is suitable for any -stable kernel which supports RAID10
      reshape, which is 3.5 and later.
      
      Fixes: 3ea7daa5 ("md/raid10: add reshape support")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c7af5eb4
    • Jialing Fu's avatar
      mmc: core: fix race condition in mmc_wait_data_done · cefcb16c
      Jialing Fu authored
      commit 71f8a4b8 upstream.
      
      The following panic is captured in ker3.14, but the issue still exists
      in latest kernel.
      ---------------------------------------------------------------------
      [   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer dereference
      at virtual address 00000578
      ......
      [   20.738499] c0 3136 (Compiler) PC is at _raw_spin_lock_irqsave+0x24/0x60
      [   20.738527] c0 3136 (Compiler) LR is at _raw_spin_lock_irqsave+0x20/0x60
      [   20.740134] c0 3136 (Compiler) Call trace:
      [   20.740165] c0 3136 (Compiler) [<ffffffc0008ee900>] _raw_spin_lock_irqsave+0x24/0x60
      [   20.740200] c0 3136 (Compiler) [<ffffffc0000dd024>] __wake_up+0x1c/0x54
      [   20.740230] c0 3136 (Compiler) [<ffffffc000639414>] mmc_wait_data_done+0x28/0x34
      [   20.740262] c0 3136 (Compiler) [<ffffffc0006391a0>] mmc_request_done+0xa4/0x220
      [   20.740314] c0 3136 (Compiler) [<ffffffc000656894>] sdhci_tasklet_finish+0xac/0x264
      [   20.740352] c0 3136 (Compiler) [<ffffffc0000a2b58>] tasklet_action+0xa0/0x158
      [   20.740382] c0 3136 (Compiler) [<ffffffc0000a2078>] __do_softirq+0x10c/0x2e4
      [   20.740411] c0 3136 (Compiler) [<ffffffc0000a24bc>] irq_exit+0x8c/0xc0
      [   20.740439] c0 3136 (Compiler) [<ffffffc00008489c>] handle_IRQ+0x48/0xac
      [   20.740469] c0 3136 (Compiler) [<ffffffc000081428>] gic_handle_irq+0x38/0x7c
      ----------------------------------------------------------------------
      Because in SMP, "mrq" has race condition between below two paths:
      path1: CPU0: <tasklet context>
        static void mmc_wait_data_done(struct mmc_request *mrq)
        {
           mrq->host->context_info.is_done_rcv = true;
           //
           // If CPU0 has just finished "is_done_rcv = true" in path1, and at
           // this moment, IRQ or ICache line missing happens in CPU0.
           // What happens in CPU1 (path2)?
           //
           // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
           // path2 would have chance to break from wait_event_interruptible
           // in mmc_wait_for_data_req_done and continue to run for next
           // mmc_request (mmc_blk_rw_rq_prep).
           //
           // Within mmc_blk_rq_prep, mrq is cleared to 0.
           // If below line still gets host from "mrq" as the result of
           // compiler, the panic happens as we traced.
           wake_up_interruptible(&mrq->host->context_info.wait);
        }
      
      path2: CPU1: <The mmcqd thread runs mmc_queue_thread>
        static int mmc_wait_for_data_req_done(...
        {
           ...
           while (1) {
                 wait_event_interruptible(context_info->wait,
                         (context_info->is_done_rcv ||
                          context_info->is_new_req));
           	   static void mmc_blk_rw_rq_prep(...
                 {
                 ...
                 memset(brq, 0, sizeof(struct mmc_blk_request));
      
      This issue happens very coincidentally; however adding mdelay(1) in
      mmc_wait_data_done as below could duplicate it easily.
      
         static void mmc_wait_data_done(struct mmc_request *mrq)
         {
           mrq->host->context_info.is_done_rcv = true;
      +    mdelay(1);
           wake_up_interruptible(&mrq->host->context_info.wait);
          }
      
      At runtime, IRQ or ICache line missing may just happen at the same place
      of the mdelay(1).
      
      This patch gets the mmc_context_info at the beginning of function, it can
      avoid this race condition.
      Signed-off-by: default avatarJialing Fu <jlfu@marvell.com>
      Tested-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Fixes: 2220eedf ("mmc: fix async request mechanism ....")
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cefcb16c
    • Jann Horn's avatar
      fs: if a coredump already exists, unlink and recreate with O_EXCL · 7e73d2c5
      Jann Horn authored
      commit fbb18169 upstream.
      
      It was possible for an attacking user to trick root (or another user) into
      writing his coredumps into an attacker-readable, pre-existing file using
      rename() or link(), causing the disclosure of secret data from the victim
      process' virtual memory.  Depending on the configuration, it was also
      possible to trick root into overwriting system files with coredumps.  Fix
      that issue by never writing coredumps into existing files.
      
      Requirements for the attack:
       - The attack only applies if the victim's process has a nonzero
         RLIMIT_CORE and is dumpable.
       - The attacker can trick the victim into coredumping into an
         attacker-writable directory D, either because the core_pattern is
         relative and the victim's cwd is attacker-writable or because an
         absolute core_pattern pointing to a world-writable directory is used.
       - The attacker has one of these:
        A: on a system with protected_hardlinks=0:
           execute access to a folder containing a victim-owned,
           attacker-readable file on the same partition as D, and the
           victim-owned file will be deleted before the main part of the attack
           takes place. (In practice, there are lots of files that fulfill
           this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
           This does not apply to most Linux systems because most distros set
           protected_hardlinks=1.
        B: on a system with protected_hardlinks=1:
           execute access to a folder containing a victim-owned,
           attacker-readable and attacker-writable file on the same partition
           as D, and the victim-owned file will be deleted before the main part
           of the attack takes place.
           (This seems to be uncommon.)
        C: on any system, independent of protected_hardlinks:
           write access to a non-sticky folder containing a victim-owned,
           attacker-readable file on the same partition as D
           (This seems to be uncommon.)
      
      The basic idea is that the attacker moves the victim-owned file to where
      he expects the victim process to dump its core.  The victim process dumps
      its core into the existing file, and the attacker reads the coredump from
      it.
      
      If the attacker can't move the file because he does not have write access
      to the containing directory, he can instead link the file to a directory
      he controls, then wait for the original link to the file to be deleted
      (because the kernel checks that the link count of the corefile is 1).
      
      A less reliable variant that requires D to be non-sticky works with link()
      and does not require deletion of the original link: link() the file into
      D, but then unlink() it directly before the kernel performs the link count
      check.
      
      On systems with protected_hardlinks=0, this variant allows an attacker to
      not only gain information from coredumps, but also clobber existing,
      victim-writable files with coredumps.  (This could theoretically lead to a
      privilege escalation.)
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7e73d2c5
    • Jaewon Kim's avatar
      vmscan: fix increasing nr_isolated incurred by putback unevictable pages · 9ee9b7b6
      Jaewon Kim authored
      commit c54839a7 upstream.
      
      reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
      number of pages removed from the candidate list.  But shrink_page_list()
      puts back mlocked pages without passing it to caller and without
      counting as nr_reclaimed.  This increases nr_isolated.
      
      To fix this, this patch changes shrink_page_list() to pass unevictable
      pages back to caller.  Caller will take care those pages.
      
      Minchan said:
      
      It fixes two issues.
      
      1. With unevictable page, cma_alloc will be successful.
      
      Exactly speaking, cma_alloc of current kernel will fail due to
      unevictable pages.
      
      2. fix leaking of NR_ISOLATED counter of vmstat
      
      With it, too_many_isolated works.  Otherwise, it could make hang until
      the process get SIGKILL.
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9ee9b7b6
    • Helge Deller's avatar
      parisc: Filter out spurious interrupts in PA-RISC irq handler · 768cb8d9
      Helge Deller authored
      commit b1b4e435 upstream.
      
      When detecting a serial port on newer PA-RISC machines (with iosapic) we have a
      long way to go to find the right IRQ line, registering it, then registering the
      serial port and the irq handler for the serial port. During this phase spurious
      interrupts for the serial port may happen which then crashes the kernel because
      the action handler might not have been set up yet.
      
      So, basically it's a race condition between the serial port hardware and the
      CPU which sets up the necessary fields in the irq sructs. The main reason for
      this race is, that we unmask the serial port irqs too early without having set
      up everything properly before (which isn't easily possible because we need the
      IRQ number to register the serial ports).
      
      This patch is a work-around for this problem. It adds checks to the CPU irq
      handler to verify if the IRQ action field has been initialized already. If not,
      we just skip this interrupt (which isn't critical for a serial port at bootup).
      The real fix would probably involve rewriting all PA-RISC specific IRQ code
      (for CPU, IOSAPIC, GSC and EISA) to use IRQ domains with proper parenting of
      the irq chips and proper irq enabling along this line.
      
      This bug has been in the PA-RISC port since the beginning, but the crashes
      happened very rarely with currently used hardware.  But on the latest machine
      which I bought (a C8000 workstation), which uses the fastest CPUs (4 x PA8900,
      1GHz) and which has the largest possible L1 cache size (64MB each), the kernel
      crashed at every boot because of this race. So, without this patch the machine
      would currently be unuseable.
      
      For the record, here is the flow logic:
      1. serial_init_chip() in 8250_gsc.c calls iosapic_serial_irq().
      2. iosapic_serial_irq() calls txn_alloc_irq() to find the irq.
      3. iosapic_serial_irq() calls cpu_claim_irq() to register the CPU irq
      4. cpu_claim_irq() unmasks the CPU irq (which it shouldn't!)
      5. serial_init_chip() then registers the 8250 port.
      Problems:
      - In step 4 the CPU irq shouldn't have been registered yet, but after step 5
      - If serial irq happens between 4 and 5 have finished, the kernel will crash
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      768cb8d9
    • Trond Myklebust's avatar
      NFS: nfs_set_pgio_error sometimes misses errors · a6209e19
      Trond Myklebust authored
      commit e9ae58ae upstream.
      
      We should ensure that we always set the pgio_header's error field
      if a READ or WRITE RPC call returns an error. The current code depends
      on 'hdr->good_bytes' always being initialised to a large value, which
      is not always done correctly by callers.
      When this happens, applications may end up missing important errors.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a6209e19
    • NeilBrown's avatar
      NFSv4: don't set SETATTR for O_RDONLY|O_EXCL · d71ea882
      NeilBrown authored
      commit efcbc04e upstream.
      
      It is unusual to combine the open flags O_RDONLY and O_EXCL, but
      it appears that libre-office does just that.
      
      [pid  3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0
      [pid  3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL <unfinished ...>
      
      NFSv4 takes O_EXCL as a sign that a setattr command should be sent,
      probably to reset the timestamps.
      
      When it was an O_RDONLY open, the SETATTR command does not
      identify any actual attributes to change.
      If no delegation was provided to the open, the SETATTR uses the
      all-zeros stateid and the request is accepted (at least by the
      Linux NFS server - no harm, no foul).
      
      If a read-delegation was provided, this is used in the SETATTR
      request, and a Netapp filer will justifiably claim
      NFS4ERR_BAD_STATEID, which the Linux client takes as a sign
      to retry - indefinitely.
      
      So only treat O_EXCL specially if O_CREAT was also given.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d71ea882
    • Filipe Manana's avatar
      Btrfs: check if previous transaction aborted to avoid fs corruption · 50ee5ae5
      Filipe Manana authored
      commit 1f9b8c8f upstream.
      
      While we are committing a transaction, it's possible the previous one is
      still finishing its commit and therefore we wait for it to finish first.
      However we were not checking if that previous transaction ended up getting
      aborted after we waited for it to commit, so we ended up committing the
      current transaction which can lead to fs corruption because the new
      superblock can point to trees that have had one or more nodes/leafs that
      were never durably persisted.
      The following sequence diagram exemplifies how this is possible:
      
                CPU 0                                                        CPU 1
      
        transaction N starts
      
        (...)
      
        btrfs_commit_transaction(N)
      
          cur_trans->state = TRANS_STATE_COMMIT_START;
          (...)
          cur_trans->state = TRANS_STATE_COMMIT_DOING;
          (...)
      
          cur_trans->state = TRANS_STATE_UNBLOCKED;
          root->fs_info->running_transaction = NULL;
      
                                                                    btrfs_start_transaction()
                                                                       --> starts transaction N + 1
      
          btrfs_write_and_wait_transaction(trans, root);
            --> starts writing all new or COWed ebs created
                at transaction N
      
                                                                    creates some new ebs, COWs some
                                                                    existing ebs but doesn't COW or
                                                                    deletes eb X
      
                                                                    btrfs_commit_transaction(N + 1)
                                                                      (...)
                                                                      cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                      (...)
                                                                      wait_for_commit(root, prev_trans);
                                                                        --> prev_trans == transaction N
      
          btrfs_write_and_wait_transaction() continues
          writing ebs
             --> fails writing eb X, we abort transaction N
                 and set bit BTRFS_FS_STATE_ERROR on
                 fs_info->fs_state, so no new transactions
                 can start after setting that bit
      
             cleanup_transaction()
               btrfs_cleanup_one_transaction()
                 wakes up task at CPU 1
      
                                                                      continues, doesn't abort because
                                                                      cur_trans->aborted (transaction N + 1)
                                                                      is zero, and no checks for bit
                                                                      BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                      are made
      
                                                                      btrfs_write_and_wait_transaction(trans, root);
                                                                        --> succeeds, no errors during writeback
      
                                                                      write_ctree_super(trans, root, 0);
                                                                        --> succeeds
                                                                        --> we have now a superblock that points us
                                                                            to some root that uses eb X, which was
                                                                            never written to disk
      
      In this scenario future attempts to read eb X from disk results in an
      error message like "parent transid verify failed on X wanted Y found Z".
      
      So fix this by aborting the current transaction if after waiting for the
      previous transaction we verify that it was aborted.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      50ee5ae5