1. 07 Aug, 2014 6 commits
    • Jan Kara's avatar
      timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks · fbbb7208
      Jan Kara authored
      commit 504d5874 upstream.
      
      clockevents_increase_min_delta() calls printk() from under
      hrtimer_bases.lock. That causes lock inversion on scheduler locks because
      printk() can call into the scheduler. Lockdep puts it as:
      
      ======================================================
      [ INFO: possible circular locking dependency detected ]
      3.15.0-rc8-06195-g939f04be #2 Not tainted
      -------------------------------------------------------
      trinity-main/74 is trying to acquire lock:
       (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c
      
      but task is already holding lock:
       (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #5 (hrtimer_bases.lock){-.-...}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
             [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
             [<81080792>] task_clock_event_start+0x3a/0x3f
             [<810807a4>] task_clock_event_add+0xd/0x14
             [<8108259a>] event_sched_in+0xb6/0x17a
             [<810826a2>] group_sched_in+0x44/0x122
             [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
             [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
             [<81082bf6>] __perf_install_in_context+0x8b/0xa3
             [<8107eb8e>] remote_function+0x12/0x2a
             [<8105f5af>] smp_call_function_single+0x2d/0x53
             [<8107e17d>] task_function_call+0x30/0x36
             [<8107fb82>] perf_install_in_context+0x87/0xbb
             [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
             [<810856f9>] SyS_perf_event_open+0x17/0x19
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #4 (&ctx->lock){......}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f04c>] _raw_spin_lock+0x21/0x30
             [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
             [<8142cacc>] __schedule+0x4c6/0x4cb
             [<8142cae0>] schedule+0xf/0x11
             [<8142f9a6>] work_resched+0x5/0x30
      
      -> #3 (&rq->lock){-.-.-.}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f04c>] _raw_spin_lock+0x21/0x30
             [<81040873>] __task_rq_lock+0x33/0x3a
             [<8104184c>] wake_up_new_task+0x25/0xc2
             [<8102474b>] do_fork+0x15c/0x2a0
             [<810248a9>] kernel_thread+0x1a/0x1f
             [<814232a2>] rest_init+0x1a/0x10e
             [<817af949>] start_kernel+0x303/0x308
             [<817af2ab>] i386_start_kernel+0x79/0x7d
      
      -> #2 (&p->pi_lock){-.-...}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<810413dd>] try_to_wake_up+0x1d/0xd6
             [<810414cd>] default_wake_function+0xb/0xd
             [<810461f3>] __wake_up_common+0x39/0x59
             [<81046346>] __wake_up+0x29/0x3b
             [<811b8733>] tty_wakeup+0x49/0x51
             [<811c3568>] uart_write_wakeup+0x17/0x19
             [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
             [<811c5f28>] serial8250_handle_irq+0x54/0x6a
             [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
             [<811c56d8>] serial8250_interrupt+0x38/0x9e
             [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
             [<81051296>] handle_irq_event+0x2c/0x43
             [<81052cee>] handle_level_irq+0x57/0x80
             [<81002a72>] handle_irq+0x46/0x5c
             [<810027df>] do_IRQ+0x32/0x89
             [<8143036e>] common_interrupt+0x2e/0x33
             [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
             [<811c25a4>] uart_start+0x2d/0x32
             [<811c2c04>] uart_write+0xc7/0xd6
             [<811bc6f6>] n_tty_write+0xb8/0x35e
             [<811b9beb>] tty_write+0x163/0x1e4
             [<811b9cd9>] redirected_tty_write+0x6d/0x75
             [<810b6ed6>] vfs_write+0x75/0xb0
             [<810b7265>] SyS_write+0x44/0x77
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #1 (&tty->write_wait){-.....}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<81046332>] __wake_up+0x15/0x3b
             [<811b8733>] tty_wakeup+0x49/0x51
             [<811c3568>] uart_write_wakeup+0x17/0x19
             [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
             [<811c5f28>] serial8250_handle_irq+0x54/0x6a
             [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
             [<811c56d8>] serial8250_interrupt+0x38/0x9e
             [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
             [<81051296>] handle_irq_event+0x2c/0x43
             [<81052cee>] handle_level_irq+0x57/0x80
             [<81002a72>] handle_irq+0x46/0x5c
             [<810027df>] do_IRQ+0x32/0x89
             [<8143036e>] common_interrupt+0x2e/0x33
             [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
             [<811c25a4>] uart_start+0x2d/0x32
             [<811c2c04>] uart_write+0xc7/0xd6
             [<811bc6f6>] n_tty_write+0xb8/0x35e
             [<811b9beb>] tty_write+0x163/0x1e4
             [<811b9cd9>] redirected_tty_write+0x6d/0x75
             [<810b6ed6>] vfs_write+0x75/0xb0
             [<810b7265>] SyS_write+0x44/0x77
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #0 (&port_lock_key){-.....}:
             [<8104a62d>] __lock_acquire+0x9ea/0xc6d
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<811c60be>] serial8250_console_write+0x8c/0x10c
             [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
             [<8104f5d5>] console_unlock+0x1d7/0x398
             [<8104fb70>] vprintk_emit+0x3da/0x3e4
             [<81425f76>] printk+0x17/0x19
             [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
             [<8105c548>] clockevents_program_event+0xe7/0xf3
             [<8105cc1c>] tick_program_event+0x1e/0x23
             [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
             [<8103c49e>] __remove_hrtimer+0x5b/0x79
             [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
             [<8103cb4b>] hrtimer_cancel+0xd/0x18
             [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
             [<81080705>] task_clock_event_stop+0x20/0x64
             [<81080756>] task_clock_event_del+0xd/0xf
             [<81081350>] event_sched_out+0xab/0x11e
             [<810813e0>] group_sched_out+0x1d/0x66
             [<81081682>] ctx_sched_out+0xaf/0xbf
             [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
             [<8142cacc>] __schedule+0x4c6/0x4cb
             [<8142cae0>] schedule+0xf/0x11
             [<8142f9a6>] work_resched+0x5/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &port_lock_key --> &ctx->lock --> hrtimer_bases.lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(hrtimer_bases.lock);
                                     lock(&ctx->lock);
                                     lock(hrtimer_bases.lock);
        lock(&port_lock_key);
      
       *** DEADLOCK ***
      
      4 locks held by trinity-main/74:
       #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
       #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
       #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4
      
      stack backtrace:
      CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04be #2
       00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
       8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
       8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
      Call Trace:
       [<81426f69>] dump_stack+0x16/0x18
       [<81425a99>] print_circular_bug+0x18f/0x19c
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<811c60be>] ? serial8250_console_write+0x8c/0x10c
       [<811c6032>] ? wait_for_xmitr+0x76/0x76
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] ? serial8250_console_write+0x8c/0x10c
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104af87>] ? lock_release+0x191/0x223
       [<811c6032>] ? wait_for_xmitr+0x76/0x76
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8104416d>] ? __dequeue_entity+0x23/0x27
       [<81044505>] ? pick_next_task_fair+0xb1/0x120
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
       [<810475b0>] ? trace_hardirqs_off+0xb/0xd
       [<81056346>] ? rcu_irq_exit+0x64/0x77
      
      Fix the problem by using printk_deferred() which does not call into the
      scheduler.
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbbb7208
    • John Stultz's avatar
      printk: rename printk_sched to printk_deferred · f73ff697
      John Stultz authored
      commit aac74dc4 upstream.
      
      After learning we'll need some sort of deferred printk functionality in
      the timekeeping core, Peter suggested we rename the printk_sched function
      so it can be reused by needed subsystems.
      
      This only changes the function name. No logic changes.
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f73ff697
    • David Rientjes's avatar
      mm, thp: do not allow thp faults to avoid cpuset restrictions · 7f6c1deb
      David Rientjes authored
      commit b104a35d upstream.
      
      The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
      should be set in allocflags.  ALLOC_CPUSET controls if a page allocation
      should be restricted only to the set of allowed cpuset mems.
      
      Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
      the fault path from using memory compaction or direct reclaim.  Thus, it
      is unfairly able to allocate outside of its cpuset mems restriction as a
      side-effect.
      
      This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
      truly GFP_ATOMIC by verifying it is also not a thp allocation.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Tested-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f6c1deb
    • James Bottomley's avatar
      scsi: handle flush errors properly · 0e04ec4d
      James Bottomley authored
      commit 89fb4cd1 upstream.
      
      Flush commands don't transfer data and thus need to be special cased
      in the I/O completion handler so that we can propagate errors to
      the block layer and filesystem.
      Signed-off-by: default avatarJames Bottomley <JBottomley@Parallels.com>
      Reported-by: default avatarSteven Haber <steven@qumulo.com>
      Tested-by: default avatarSteven Haber <steven@qumulo.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e04ec4d
    • Konstantin Khlebnikov's avatar
      ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout · 2e926a6b
      Konstantin Khlebnikov authored
      commit 811a2407 upstream.
      
      On LPAE, each level 1 (pgd) page table entry maps 1GiB, and the level 2
      (pmd) entries map 2MiB.
      
      When the identity mapping is created on LPAE, the pgd pointers are copied
      from the swapper_pg_dir.  If we find that we need to modify the contents
      of a pmd, we allocate a new empty pmd table and insert it into the
      appropriate 1GB slot, before then filling it with the identity mapping.
      
      However, if the 1GB slot covers the kernel lowmem mappings, we obliterate
      those mappings.
      
      When replacing a PMD, first copy the old PMD contents to the new PMD, so
      that we preserve the existing mappings, particularly the mappings of the
      kernel itself.
      
      [rewrote commit message and added code comment -- rmk]
      
      Fixes: ae2de101 ("ARM: LPAE: Add identity mapping support for the 3-level page table format")
      Signed-off-by: default avatarKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e926a6b
    • Milan Broz's avatar
      crypto: af_alg - properly label AF_ALG socket · 44ce8f87
      Milan Broz authored
      commit 4c63f83c upstream.
      
      Th AF_ALG socket was missing a security label (e.g. SELinux)
      which means that socket was in "unlabeled" state.
      
      This was recently demonstrated in the cryptsetup package
      (cryptsetup v1.6.5 and later.)
      See https://bugzilla.redhat.com/show_bug.cgi?id=1115120
      
      This patch clones the sock's label from the parent sock
      and resolves the issue (similar to AF_BLUETOOTH protocol family).
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44ce8f87
  2. 31 Jul, 2014 11 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.4.101 · 91f7c8cb
      Greg Kroah-Hartman authored
      91f7c8cb
    • Catalin Marinas's avatar
      mm: kmemleak: avoid false negatives on vmalloc'ed objects · 7e2e6118
      Catalin Marinas authored
      commit 7f88f88f upstream.
      
      Commit 248ac0e1 ("mm/vmalloc: remove guard page from between vmap
      blocks") had the side effect of making vmap_area.va_end member point to
      the next vmap_area.va_start.  This was creating an artificial reference
      to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.
      
      This patch marks the vmap_area containing pointers explicitly and
      reduces the min ref_count to 2 as vm_struct still contains a reference
      to the vmalloc'ed object.  The kmemleak add_scan_area() function has
      been improved to allow a SIZE_MAX argument covering the rest of the
      object (for simpler calling sites).
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [hq: Backported to 3.4: Adjust context]
      Signed-off-by: default avatarQiang Huang <h.huangqiang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      7e2e6118
    • Xi Wang's avatar
      introduce SIZE_MAX · b06b5c62
      Xi Wang authored
      commit a3860c1c upstream.
      
      ULONG_MAX is often used to check for integer overflow when calculating
      allocation size.  While ULONG_MAX happens to work on most systems, there
      is no guarantee that `size_t' must be the same size as `long'.
      
      This patch introduces SIZE_MAX, the maximum value of `size_t', to improve
      portability and readability for allocation size validation.
      Signed-off-by: default avatarXi Wang <xi.wang@gmail.com>
      Acked-by: default avatarAlex Elder <elder@dreamhost.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Qiang Huang <h.huangqiang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b06b5c62
    • Martin Schwidefsky's avatar
      s390/ptrace: fix PSW mask check · 883ea134
      Martin Schwidefsky authored
      commit dab6cf55 upstream.
      
      The PSW mask check of the PTRACE_POKEUSR_AREA command is incorrect.
      The PSW_MASK_USER define contains the PSW_MASK_ASC bits, the ptrace
      interface accepts all combinations for the address-space-control
      bits. To protect the kernel space the PSW mask check in ptrace needs
      to reject the address-space-control bit combination for home space.
      
      Fixes CVE-2014-3534
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      883ea134
    • Linus Torvalds's avatar
      Fix gcc-4.9.0 miscompilation of load_balance() in scheduler · e5c662d1
      Linus Torvalds authored
      commit 2062afb4 upstream.
      
      Michel Dänzer and a couple of other people reported inexplicable random
      oopses in the scheduler, and the cause turns out to be gcc mis-compiling
      the load_balance() function when debugging is enabled.  The gcc bug
      apparently goes back to gcc-4.5, but slight optimization changes means
      that it now showed up as a problem in 4.9.0 and 4.9.1.
      
      The instruction scheduling problem causes gcc to schedule a spill
      operation to before the stack frame has been created, which in turn can
      corrupt the spilled value if an interrupt comes in.  There may be other
      effects of this bug too, but that's the code generation problem seen in
      Michel's case.
      
      This is fixed in current gcc HEAD, but the workaround as suggested by
      Markus Trippelsdorf is pretty simple: use -fno-var-tracking-assignments
      when compiling the kernel, which disables the gcc code that causes the
      problem.  This can result in slightly worse debug information for
      variable accesses, but that is infinitely preferable to actual code
      generation problems.
      
      Doing this unconditionally (not just for CONFIG_DEBUG_INFO) also allows
      non-debug builds to verify that the debug build would be identical: we
      can do
      
          export GCC_COMPARE_DEBUG=1
      
      to make gcc internally verify that the result of the build is
      independent of the "-g" flag (it will make the compiler build everything
      twice, toggling the debug flag, and compare the results).
      
      Without the "-fno-var-tracking-assignments" option, the build would fail
      (even with 4.8.3 that didn't show the actual stack frame bug) with a gcc
      compare failure.
      
      See also gcc bugzilla:
      
        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61801Reported-by: default avatarMichel Dänzer <michel@daenzer.net>
      Suggested-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5c662d1
    • Naoya Horiguchi's avatar
      mm: hugetlb: fix copy_hugetlb_page_range() · 97f5dabb
      Naoya Horiguchi authored
      commit 0253d634 upstream.
      
      Commit 4a705fef ("hugetlb: fix copy_hugetlb_page_range() to handle
      migration/hwpoisoned entry") changed the order of
      huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
      in some workloads like hugepage-backed heap allocation via libhugetlbfs.
      This patch fixes it.
      
      The test program for the problem is shown below:
      
        $ cat heap.c
        #include <unistd.h>
        #include <stdlib.h>
        #include <string.h>
      
        #define HPS 0x200000
      
        int main() {
        	int i;
        	char *p = malloc(HPS);
        	memset(p, '1', HPS);
        	for (i = 0; i < 5; i++) {
        		if (!fork()) {
        			memset(p, '2', HPS);
        			p = malloc(HPS);
        			memset(p, '3', HPS);
        			free(p);
        			return 0;
        		}
        	}
        	sleep(1);
        	free(p);
        	return 0;
        }
      
        $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap
      
      Fixes 4a705fef ("hugetlb: fix copy_hugetlb_page_range() to handle
      migration/hwpoisoned entry"), so is applicable to -stable kernels which
      include it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarGuillaume Morin <guillaume@morinfr.org>
      Suggested-by: default avatarGuillaume Morin <guillaume@morinfr.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      97f5dabb
    • Sven Wegener's avatar
      x86_32, entry: Store badsys error code in %eax · 4aedd4b0
      Sven Wegener authored
      commit 8142b215 upstream.
      
      Commit 554086d8 ("x86_32, entry: Do syscall exit work on badsys
      (CVE-2014-4508)") introduced a regression in the x86_32 syscall entry
      code, resulting in syscall() not returning proper errors for undefined
      syscalls on CPUs supporting the sysenter feature.
      
      The following code:
      
      > int result = syscall(666);
      > printf("result=%d errno=%d error=%s\n", result, errno, strerror(errno));
      
      results in:
      
      > result=666 errno=0 error=Success
      
      Obviously, the syscall return value is the called syscall number, but it
      should have been an ENOSYS error. When run under ptrace it behaves
      correctly, which makes it hard to debug in the wild:
      
      > result=-1 errno=38 error=Function not implemented
      
      The %eax register is the return value register. For debugging via ptrace
      the syscall entry code stores the complete register context on the
      stack. The badsys handlers only store the ENOSYS error code in the
      ptrace register set and do not set %eax like a regular syscall handler
      would. The old resume_userspace call chain contains code that clobbers
      %eax and it restores %eax from the ptrace registers afterwards. The same
      goes for the ptrace-enabled call chain. When ptrace is not used, the
      syscall return value is the passed-in syscall number from the untouched
      %eax register.
      
      Use %eax as the return value register in syscall_badsys and
      sysenter_badsys, like a real syscall handler does, and have the caller
      push the value onto the stack for ptrace access.
      Signed-off-by: default avatarSven Wegener <sven.wegener@stealer.net>
      Link: http://lkml.kernel.org/r/alpine.LNX.2.11.1407221022380.31021@titan.int.lan.stealer.netReviewed-and-tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4aedd4b0
    • Romain Degez's avatar
      ahci: add support for the Promise FastTrak TX8660 SATA HBA (ahci mode) · 46005527
      Romain Degez authored
      commit b32bfc06 upstream.
      
      Add support of the Promise FastTrak TX8660 SATA HBA in ahci mode by
      registering the board in the ahci_pci_tbl[].
      
      Note: this HBA also provide a hardware RAID mode when activated in
      BIOS but specific drivers from the manufacturer are required in this
      case.
      Signed-off-by: default avatarRomain Degez <romain.degez@gmail.com>
      Tested-by: default avatarRomain Degez <romain.degez@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46005527
    • Tejun Heo's avatar
      libata: introduce ata_host->n_tags to avoid oops on SAS controllers · 80cd492c
      Tejun Heo authored
      commit 1a112d10 upstream.
      
      1871ee13 ("libata: support the ata host which implements a queue
      depth less than 32") directly used ata_port->scsi_host->can_queue from
      ata_qc_new() to determine the number of tags supported by the host;
      unfortunately, SAS controllers doing SATA don't initialize ->scsi_host
      leading to the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
       IP: [<ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
       PGD 0
       Oops: 0002 [#1] SMP
       Modules linked in: isci libsas scsi_transport_sas mgag200 drm_kms_helper ttm
       CPU: 1 PID: 518 Comm: udevd Not tainted 3.16.0-rc6+ #62
       Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
       task: ffff880c1a00b280 ti: ffff88061a000000 task.ti: ffff88061a000000
       RIP: 0010:[<ffffffff814e0618>]  [<ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
       RSP: 0018:ffff88061a003ae8  EFLAGS: 00010012
       RAX: 0000000000000001 RBX: ffff88000241ca80 RCX: 00000000000000fa
       RDX: 0000000000000020 RSI: 0000000000000020 RDI: ffff8806194aa298
       RBP: ffff88061a003ae8 R08: ffff8806194a8000 R09: 0000000000000000
       R10: 0000000000000000 R11: ffff88000241ca80 R12: ffff88061ad58200
       R13: ffff8806194aa298 R14: ffffffff814e67a0 R15: ffff8806194a8000
       FS:  00007f3ad7fe3840(0000) GS:ffff880627620000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000058 CR3: 000000061a118000 CR4: 00000000001407e0
       Stack:
        ffff88061a003b20 ffffffff814e96e1 ffff88000241ca80 ffff88061ad58200
        ffff8800b6bf6000 ffff880c1c988000 ffff880619903850 ffff88061a003b68
        ffffffffa0056ce1 ffff88061a003b48 0000000013d6e6f8 ffff88000241ca80
       Call Trace:
        [<ffffffff814e96e1>] ata_sas_queuecmd+0xa1/0x430
        [<ffffffffa0056ce1>] sas_queuecommand+0x191/0x220 [libsas]
        [<ffffffff8149afee>] scsi_dispatch_cmd+0x10e/0x300 [<ffffffff814a3bc5>] scsi_request_fn+0x2f5/0x550
        [<ffffffff81317613>] __blk_run_queue+0x33/0x40
        [<ffffffff8131781a>] queue_unplugged+0x2a/0x90
        [<ffffffff8131ceb4>] blk_flush_plug_list+0x1b4/0x210
        [<ffffffff8131d274>] blk_finish_plug+0x14/0x50
        [<ffffffff8117eaa8>] __do_page_cache_readahead+0x198/0x1f0
        [<ffffffff8117ee21>] force_page_cache_readahead+0x31/0x50
        [<ffffffff8117ee7e>] page_cache_sync_readahead+0x3e/0x50
        [<ffffffff81172ac6>] generic_file_read_iter+0x496/0x5a0
        [<ffffffff81219897>] blkdev_read_iter+0x37/0x40
        [<ffffffff811e307e>] new_sync_read+0x7e/0xb0
        [<ffffffff811e3734>] vfs_read+0x94/0x170
        [<ffffffff811e43c6>] SyS_read+0x46/0xb0
        [<ffffffff811e33d1>] ? SyS_lseek+0x91/0xb0
        [<ffffffff8171ee29>] system_call_fastpath+0x16/0x1b
       Code: 00 00 00 88 50 29 83 7f 08 01 19 d2 83 e2 f0 83 ea 50 88 50 34 c6 81 1d 02 00 00 40 c6 81 17 02 00 00 00 5d c3 66 0f 1f 44 00 00 <89> 14 25 58 00 00 00
      
      Fix it by introducing ata_host->n_tags which is initialized to
      ATA_MAX_QUEUE - 1 in ata_host_init() for SAS controllers and set to
      scsi_host_template->can_queue in ata_host_register() for !SAS ones.
      As SAS hosts are never registered, this will give them the same
      ATA_MAX_QUEUE - 1 as before.  Note that we can't use
      scsi_host->can_queue directly for SAS hosts anyway as they can go
      higher than the libata maximum.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMike Qiu <qiudayu@linux.vnet.ibm.com>
      Reported-by: default avatarJesse Brandeburg <jesse.brandeburg@gmail.com>
      Reported-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Fixes: 1871ee13 ("libata: support the ata host which implements a queue depth less than 32")
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80cd492c
    • Kevin Hao's avatar
      libata: support the ata host which implements a queue depth less than 32 · 6f0844c4
      Kevin Hao authored
      commit 1871ee13 upstream.
      
      The sata on fsl mpc8315e is broken after the commit 8a4aeec8
      ("libata/ahci: accommodate tag ordered controllers"). The reason is
      that the ata controller on this SoC only implement a queue depth of
      16. When issuing the commands in tag order, all the commands in tag
      16 ~ 31 are mapped to tag 0 unconditionally and then causes the sata
      malfunction. It makes no senses to use a 32 queue in software while
      the hardware has less queue depth. So consider the queue depth
      implemented by the hardware when requesting a command tag.
      
      Fixes: 8a4aeec8 ("libata/ahci: accommodate tag ordered controllers")
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f0844c4
    • Christoph Hellwig's avatar
      block: don't assume last put of shared tags is for the host · 696acd9d
      Christoph Hellwig authored
      commit d45b3279 upstream.
      
      There is no inherent reason why the last put of a tag structure must be
      the one for the Scsi_Host, as device model objects can be held for
      arbitrary periods.  Merge blk_free_tags and __blk_free_tags into a single
      funtion that just release a references and get rid of the BUG() when the
      host reference wasn't the last.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      696acd9d
  3. 28 Jul, 2014 23 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.4.100 · 82f9c4a3
      Greg Kroah-Hartman authored
      82f9c4a3
    • Takao Indoh's avatar
      iommu/vt-d: Disable translation if already enabled · 21870a3c
      Takao Indoh authored
      commit 3a93c841 upstream.
      
      This patch disables translation(dma-remapping) before its initialization
      if it is already enabled.
      
      This is needed for kexec/kdump boot. If dma-remapping is enabled in the
      first kernel, it need to be disabled before initializing its page table
      during second kernel boot. Wei Hu also reported that this is needed
      when second kernel boots with intel_iommu=off.
      
      Basically iommu->gcmd is used to know whether translation is enabled or
      disabled, but it is always zero at boot time even when translation is
      enabled since iommu->gcmd is initialized without considering such a
      case. Therefor this patch synchronizes iommu->gcmd value with global
      command register when iommu structure is allocated.
      Signed-off-by: default avatarTakao Indoh <indou.takao@jp.fujitsu.com>
      Signed-off-by: default avatarJoerg Roedel <joro@8bytes.org>
      [wyj: Backported to 3.4: adjust context]
      Signed-off-by: default avatarYijing Wang <wangyijing@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21870a3c
    • Takashi Iwai's avatar
      PM / sleep: Fix request_firmware() error at resume · 5dc1c885
      Takashi Iwai authored
      commit 4320f6b1 upstream.
      
      The commit [247bc037: PM / Sleep: Mitigate race between the freezer
      and request_firmware()] introduced the finer state control, but it
      also leads to a new bug; for example, a bug report regarding the
      firmware loading of intel BT device at suspend/resume:
        https://bugzilla.novell.com/show_bug.cgi?id=873790
      
      The root cause seems to be a small window between the process resume
      and the clear of usermodehelper lock.  The request_firmware() function
      checks the UMH lock and gives up when it's in UMH_DISABLE state.  This
      is for avoiding the invalid  f/w loading during suspend/resume phase.
      The problem is, however, that usermodehelper_enable() is called at the
      end of thaw_processes().  Thus, a thawed process in between can kick
      off the f/w loader code path (in this case, via btusb_setup_intel())
      even before the call of usermodehelper_enable().  Then
      usermodehelper_read_trylock() returns an error and request_firmware()
      spews WARN_ON() in the end.
      
      This oneliner patch fixes the issue just by setting to UMH_FREEZING
      state again before restarting tasks, so that the call of
      request_firmware() will be blocked until the end of this function
      instead of returning an error.
      
      Fixes: 247bc037 (PM / Sleep: Mitigate race between the freezer and request_firmware())
      Link: https://bugzilla.novell.com/show_bug.cgi?id=873790Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5dc1c885
    • John Stultz's avatar
      alarmtimer: Fix bug where relative alarm timers were treated as absolute · 299e667e
      John Stultz authored
      commit 16927776 upstream.
      
      Sharvil noticed with the posix timer_settime interface, using the
      CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
      tried to specify a relative time timer, it would incorrectly be
      treated as absolute regardless of the state of the flags argument.
      
      This patch corrects this, properly checking the absolute/relative flag,
      as well as adds further error checking that no invalid flag bits are set.
      Reported-by: default avatarSharvil Nanavati <sharvil@google.com>
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      299e667e
    • Alex Deucher's avatar
      drm/radeon: avoid leaking edid data · b63dd4c8
      Alex Deucher authored
      commit 0ac66eff upstream.
      
      In some cases we fetch the edid in the detect() callback
      in order to determine what sort of monitor is connected.
      If that happens, don't fetch the edid again in the get_modes()
      callback or we will leak the edid.
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b63dd4c8
    • Amitkumar Karwar's avatar
      mwifiex: fix Tx timeout issue · ed379762
      Amitkumar Karwar authored
      commit d76744a9 upstream.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=70191
      https://bugzilla.kernel.org/show_bug.cgi?id=77581
      
      It is observed that sometimes Tx packet is downloaded without
      adding driver's txpd header. This results in firmware parsing
      garbage data as packet length. Sometimes firmware is unable
      to read the packet if length comes out as invalid. This stops
      further traffic and timeout occurs.
      
      The root cause is uninitialized fields in tx_info(skb->cb) of
      packet used to get garbage values. In this case if
      MWIFIEX_BUF_FLAG_REQUEUED_PKT flag is mistakenly set, txpd
      header was skipped. This patch makes sure that tx_info is
      correctly initialized to fix the problem.
      Reported-by: default avatarAndrew Wiley <wiley.andrew.j@gmail.com>
      Reported-by: default avatarLinus Gasser <list@markas-al-nour.org>
      Reported-by: default avatarMichael Hirsch <hirsch@teufel.de>
      Tested-by: default avatarXinming Hu <huxm@marvell.com>
      Signed-off-by: default avatarAmitkumar Karwar <akarwar@marvell.com>
      Signed-off-by: default avatarMaithili Hinge <maithili@marvell.com>
      Signed-off-by: default avatarAvinash Patil <patila@marvell.com>
      Signed-off-by: default avatarBing Zhao <bzhao@marvell.com>
      Signed-off-by: default avatarJohn W. Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed379762
    • HATAYAMA Daisuke's avatar
      perf/x86/intel: ignore CondChgd bit to avoid false NMI handling · a2b2c030
      HATAYAMA Daisuke authored
      commit b292d7a1 upstream.
      
      Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
      if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
      
      For example, we use external NMI to make system panic to get crash
      dump, but in this case, the external NMI is falsely handled do to the
      issue.
      
      This commit deals with the issue simply by ignoring CondChgd bit.
      
      Here is explanation in detail.
      
      On x86 NMI watchdog uses performance monitoring feature to
      periodically signal NMI each time performance counter gets overflowed.
      
      intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
      handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
      owner of a given NMI by looking at overflow status bits in
      MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
      handles the given NMI as its own NMI.
      
      The problem is that the intel_pmu_handle_irq() doesn't distinguish
      CondChgd bit from other bits. Unlike the other status bits, CondChgd
      bit doesn't represent overflow status for performance counters. Thus,
      CondChgd bit cannot be thought of as a mark indicating a given NMI is
      NMI watchdog's.
      
      As a result, if CondChgd bit is set, any NMI is falsely handled by the
      NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
      is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
      action is never performed until CondChgd bit is cleared.
      
      I noticed this behavior on systems with Ivy Bridge processors: Intel
      Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
      CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
      in the beginning at boot. Then the CondChgd bit is immediately cleared
      by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
      0.
      
      On the other hand, on older processors such as Nehalem, Xeon E7540,
      CondChgd bit is not set in the beginning at boot.
      
      I'm not sure about exact behavior of CondChgd bit, in particular when
      this bit is set. Although I read Intel System Programmer's Manual to
      figure out that, the descriptions I found are:
      
        In 18.9.1:
      
        "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to
         indicate changes to the state of performancmonitoring hardware"
      
        In Table 35-2 IA-32 Architectural MSRs
      
        63 CondChg: status bits of this register has changed.
      
      These are different from the bahviour I see on the actual system as I
      explained above.
      
      At least, I think ignoring CondChgd bit should be enough for NMI
      watchdog perspective.
      Signed-off-by: default avatarHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatayama@jp.fujitsu.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2b2c030
    • Eric Dumazet's avatar
      ipv4: fix buffer overflow in ip_options_compile() · 26adeae3
      Eric Dumazet authored
      [ Upstream commit 10ec9472 ]
      
      There is a benign buffer overflow in ip_options_compile spotted by
      AddressSanitizer[1] :
      
      Its benign because we always can access one extra byte in skb->head
      (because header is followed by struct skb_shared_info), and in this case
      this byte is not even used.
      
      [28504.910798] ==================================================================
      [28504.912046] AddressSanitizer: heap-buffer-overflow in ip_options_compile
      [28504.913170] Read of size 1 by thread T15843:
      [28504.914026]  [<ffffffff81802f91>] ip_options_compile+0x121/0x9c0
      [28504.915394]  [<ffffffff81804a0d>] ip_options_get_from_user+0xad/0x120
      [28504.916843]  [<ffffffff8180dedf>] do_ip_setsockopt.isra.15+0x8df/0x1630
      [28504.918175]  [<ffffffff8180ec60>] ip_setsockopt+0x30/0xa0
      [28504.919490]  [<ffffffff8181e59b>] tcp_setsockopt+0x5b/0x90
      [28504.920835]  [<ffffffff8177462f>] sock_common_setsockopt+0x5f/0x70
      [28504.922208]  [<ffffffff817729c2>] SyS_setsockopt+0xa2/0x140
      [28504.923459]  [<ffffffff818cfb69>] system_call_fastpath+0x16/0x1b
      [28504.924722]
      [28504.925106] Allocated by thread T15843:
      [28504.925815]  [<ffffffff81804995>] ip_options_get_from_user+0x35/0x120
      [28504.926884]  [<ffffffff8180dedf>] do_ip_setsockopt.isra.15+0x8df/0x1630
      [28504.927975]  [<ffffffff8180ec60>] ip_setsockopt+0x30/0xa0
      [28504.929175]  [<ffffffff8181e59b>] tcp_setsockopt+0x5b/0x90
      [28504.930400]  [<ffffffff8177462f>] sock_common_setsockopt+0x5f/0x70
      [28504.931677]  [<ffffffff817729c2>] SyS_setsockopt+0xa2/0x140
      [28504.932851]  [<ffffffff818cfb69>] system_call_fastpath+0x16/0x1b
      [28504.934018]
      [28504.934377] The buggy address ffff880026382828 is located 0 bytes to the right
      [28504.934377]  of 40-byte region [ffff880026382800, ffff880026382828)
      [28504.937144]
      [28504.937474] Memory state around the buggy address:
      [28504.938430]  ffff880026382300: ........ rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.939884]  ffff880026382400: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.941294]  ffff880026382500: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.942504]  ffff880026382600: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.943483]  ffff880026382700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.944511] >ffff880026382800: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.945573]                         ^
      [28504.946277]  ffff880026382900: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.094949]  ffff880026382a00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.096114]  ffff880026382b00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.097116]  ffff880026382c00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.098472]  ffff880026382d00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.099804] Legend:
      [28505.100269]  f - 8 freed bytes
      [28505.100884]  r - 8 redzone bytes
      [28505.101649]  . - 8 allocated bytes
      [28505.102406]  x=1..7 - x allocated bytes + (8-x) redzone bytes
      [28505.103637] ==================================================================
      
      [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernelSigned-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      26adeae3
    • Ben Hutchings's avatar
      dns_resolver: Null-terminate the right string · 4427f9ab
      Ben Hutchings authored
      [ Upstream commit 640d7efe ]
      
      *_result[len] is parsed as *(_result[len]) which is not at all what we
      want to touch here.
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Fixes: 84a7c0b1 ("dns_resolver: assure that dns_query() result is null-terminated")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4427f9ab
    • Manuel Schölling's avatar
      dns_resolver: assure that dns_query() result is null-terminated · c304a23b
      Manuel Schölling authored
      [ Upstream commit 84a7c0b1 ]
      
      dns_query() credulously assumes that keys are null-terminated and
      returns a copy of a memory block that is off by one.
      Signed-off-by: default avatarManuel Schölling <manuel.schoelling@gmx.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c304a23b
    • Sowmini Varadhan's avatar
      sunvnet: clean up objects created in vnet_new() on vnet_exit() · 7670d472
      Sowmini Varadhan authored
      [ Upstream commit a4b70a07 ]
      
      Nothing cleans up the objects created by
      vnet_new(), they are completely leaked.
      
      vnet_exit(), after doing the vio_unregister_driver() to clean
      up ports, should call a helper function that iterates over vnet_list
      and cleans up those objects. This includes unregister_netdevice()
      as well as free_netdev().
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Reviewed-by: default avatarKarl Volz <karl.volz@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7670d472
    • Christoph Schulz's avatar
      net: pppoe: use correct channel MTU when using Multilink PPP · 48b83dfd
      Christoph Schulz authored
      [ Upstream commit a8a3e41c ]
      
      The PPP channel MTU is used with Multilink PPP when ppp_mp_explode() (see
      ppp_generic module) tries to determine how big a fragment might be. According
      to RFC 1661, the MTU excludes the 2-byte PPP protocol field, see the
      corresponding comment and code in ppp_mp_explode():
      
      		/*
      		 * hdrlen includes the 2-byte PPP protocol field, but the
      		 * MTU counts only the payload excluding the protocol field.
      		 * (RFC1661 Section 2)
      		 */
      		mtu = pch->chan->mtu - (hdrlen - 2);
      
      However, the pppoe module *does* include the PPP protocol field in the channel
      MTU, which is wrong as it causes the PPP payload to be 1-2 bytes too big under
      certain circumstances (one byte if PPP protocol compression is used, two
      otherwise), causing the generated Ethernet packets to be dropped. So the pppoe
      module has to subtract two bytes from the channel MTU. This error only
      manifests itself when using Multilink PPP, as otherwise the channel MTU is not
      used anywhere.
      
      In the following, I will describe how to reproduce this bug. We configure two
      pppd instances for multilink PPP over two PPPoE links, say eth2 and eth3, with
      a MTU of 1492 bytes for each link and a MRRU of 2976 bytes. (This MRRU is
      computed by adding the two link MTUs and subtracting the MP header twice, which
      is 4 bytes long.) The necessary pppd statements on both sides are "multilink
      mtu 1492 mru 1492 mrru 2976". On the client side, we additionally need "plugin
      rp-pppoe.so eth2" and "plugin rp-pppoe.so eth3", respectively; on the server
      side, we additionally need to start two pppoe-server instances to be able to
      establish two PPPoE sessions, one over eth2 and one over eth3. We set the MTU
      of the PPP network interface to the MRRU (2976) on both sides of the connection
      in order to make use of the higher bandwidth. (If we didn't do that, IP
      fragmentation would kick in, which we want to avoid.)
      
      Now we send a ICMPv4 echo request with a payload of 2948 bytes from client to
      server over the PPP link. This results in the following network packet:
      
         2948 (echo payload)
       +    8 (ICMPv4 header)
       +   20 (IPv4 header)
      ---------------------
         2976 (PPP payload)
      
      These 2976 bytes do not exceed the MTU of the PPP network interface, so the
      IP packet is not fragmented. Now the multilink PPP code in ppp_mp_explode()
      prepends one protocol byte (0x21 for IPv4), making the packet one byte bigger
      than the negotiated MRRU. So this packet would have to be divided in three
      fragments. But this does not happen as each link MTU is assumed to be two bytes
      larger. So this packet is diveded into two fragments only, one of size 1489 and
      one of size 1488. Now we have for that bigger fragment:
      
         1489 (PPP payload)
       +    4 (MP header)
       +    2 (PPP protocol field for the MP payload (0x3d))
       +    6 (PPPoE header)
      --------------------------
         1501 (Ethernet payload)
      
      This packet exceeds the link MTU and is discarded.
      
      If one configures the link MTU on the client side to 1501, one can see the
      discarded Ethernet frames with tcpdump running on the client. A
      
      ping -s 2948 -c 1 192.168.15.254
      
      leads to the smaller fragment that is correctly received on the server side:
      
      (tcpdump -vvvne -i eth3 pppoes and ppp proto 0x3d)
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x3] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [end], length 1492
      
      and to the bigger fragment that is not received on the server side:
      
      (tcpdump -vvvne -i eth2 pppoes and ppp proto 0x3d)
      52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
        length 1515: PPPoE  [ses 0x5] MLPPP (0x003d), length 1495: seq 0x000,
        Flags [begin], length 1493
      
      With the patch below, we correctly obtain three fragments:
      
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [begin], length 1492
      52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [none], length 1492
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 27: PPPoE  [ses 0x1] MLPPP (0x003d), length 7: seq 0x000,
        Flags [end], length 5
      
      And the ICMPv4 echo request is successfully received at the server side:
      
      IP (tos 0x0, ttl 64, id 21925, offset 0, flags [DF], proto ICMP (1),
        length 2976)
          192.168.222.2 > 192.168.15.254: ICMP echo request, id 30530, seq 0,
            length 2956
      
      The bug was introduced in commit c9aa6895
      ("[PPPOE]: Advertise PPPoE MTU") from the very beginning. This patch applies
      to 3.10 upwards but the fix can be applied (with minor modifications) to
      kernels as old as 2.6.32.
      Signed-off-by: default avatarChristoph Schulz <develop@kristov.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      48b83dfd
    • Daniel Borkmann's avatar
      net: sctp: fix information leaks in ulpevent layer · 9de2be63
      Daniel Borkmann authored
      [ Upstream commit 8f2e5ae4 ]
      
      While working on some other SCTP code, I noticed that some
      structures shared with user space are leaking uninitialized
      stack or heap buffer. In particular, struct sctp_sndrcvinfo
      has a 2 bytes hole between .sinfo_flags and .sinfo_ppid that
      remains unfilled by us in sctp_ulpevent_read_sndrcvinfo() when
      putting this into cmsg. But also struct sctp_remote_error
      contains a 2 bytes hole that we don't fill but place into a skb
      through skb_copy_expand() via sctp_ulpevent_make_remote_error().
      
      Both structures are defined by the IETF in RFC6458:
      
      * Section 5.3.2. SCTP Header Information Structure:
      
        The sctp_sndrcvinfo structure is defined below:
      
        struct sctp_sndrcvinfo {
          uint16_t sinfo_stream;
          uint16_t sinfo_ssn;
          uint16_t sinfo_flags;
          <-- 2 bytes hole  -->
          uint32_t sinfo_ppid;
          uint32_t sinfo_context;
          uint32_t sinfo_timetolive;
          uint32_t sinfo_tsn;
          uint32_t sinfo_cumtsn;
          sctp_assoc_t sinfo_assoc_id;
        };
      
      * 6.1.3. SCTP_REMOTE_ERROR:
      
        A remote peer may send an Operation Error message to its peer.
        This message indicates a variety of error conditions on an
        association. The entire ERROR chunk as it appears on the wire
        is included in an SCTP_REMOTE_ERROR event. Please refer to the
        SCTP specification [RFC4960] and any extensions for a list of
        possible error formats. An SCTP error notification has the
        following format:
      
        struct sctp_remote_error {
          uint16_t sre_type;
          uint16_t sre_flags;
          uint32_t sre_length;
          uint16_t sre_error;
          <-- 2 bytes hole  -->
          sctp_assoc_t sre_assoc_id;
          uint8_t  sre_data[];
        };
      
      Fix this by setting both to 0 before filling them out. We also
      have other structures shared between user and kernel space in
      SCTP that contains holes (e.g. struct sctp_paddrthlds), but we
      copy that buffer over from user space first and thus don't need
      to care about it in that cases.
      
      While at it, we can also remove lengthy comments copied from
      the draft, instead, we update the comment with the correct RFC
      number where one can look it up.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9de2be63
    • Jon Paul Maloy's avatar
      tipc: clear 'next'-pointer of message fragments before reassembly · d1558f71
      Jon Paul Maloy authored
      [ Upstream commit 99941754 ]
      
      If the 'next' pointer of the last fragment buffer in a message is not
      zeroed before reassembly, we risk ending up with a corrupt message,
      since the reassembly function itself isn't doing this.
      
      Currently, when a buffer is retrieved from the deferred queue of the
      broadcast link, the next pointer is not cleared, with the result as
      described above.
      
      This commit corrects this, and thereby fixes a bug that may occur when
      long broadcast messages are transmitted across dual interfaces. The bug
      has been present since 40ba3cdf ("tipc:
      message reassembly using fragment chain")
      
      This commit should be applied to both net and net-next.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1558f71
    • Suresh Reddy's avatar
      be2net: set EQ DB clear-intr bit in be_open() · 3c656d48
      Suresh Reddy authored
      [ Upstream commit 4cad9f3b ]
      
      On BE3, if the clear-interrupt bit of the EQ doorbell is not set the first
      time it is armed, ocassionally we have observed that the EQ doesn't raise
      anymore interrupts even if it is in armed state.
      This patch fixes this by setting the clear-interrupt bit when EQs are
      armed for the first time in be_open().
      Signed-off-by: default avatarSuresh Reddy <Suresh.Reddy@emulex.com>
      Signed-off-by: default avatarSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c656d48
    • Andrey Utkin's avatar
      appletalk: Fix socket referencing in skb · 3fa1f507
      Andrey Utkin authored
      [ Upstream commit 36beddc2 ]
      
      Setting just skb->sk without taking its reference and setting a
      destructor is invalid. However, in the places where this was done, skb
      is used in a way not requiring skb->sk setting. So dropping the setting
      of skb->sk.
      Thanks to Eric Dumazet <eric.dumazet@gmail.com> for correct solution.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=79441Reported-by: default avatarEd Martin <edman007@edman007.com>
      Signed-off-by: default avatarAndrey Utkin <andrey.krieger.utkin@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fa1f507
    • Yuchung Cheng's avatar
      tcp: fix false undo corner cases · f37e4913
      Yuchung Cheng authored
      [ Upstream commit 6e08d5e3 ]
      
      The undo code assumes that, upon entering loss recovery, TCP
      1) always retransmit something
      2) the retransmission never fails locally (e.g., qdisc drop)
      
      so undo_marker is set in tcp_enter_recovery() and undo_retrans is
      incremented only when tcp_retransmit_skb() is successful.
      
      When the assumption is broken because TCP's cwnd is too small to
      retransmit or the retransmit fails locally. The next (DUP)ACK
      would incorrectly revert the cwnd and the congestion state in
      tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
      may enter the recovery state. The sender repeatedly enter and
      (incorrectly) exit recovery states if the retransmits continue to
      fail locally while receiving (DUP)ACKs.
      
      The fix is to initialize undo_retrans to -1 and start counting on
      the first retransmission. Always increment undo_retrans even if the
      retransmissions fail locally because they couldn't cause DSACKs to
      undo the cwnd reduction.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f37e4913
    • dingtianhong's avatar
      igmp: fix the problem when mc leave group · fbaf3a0b
      dingtianhong authored
      [ Upstream commit 52ad353a ]
      
      The problem was triggered by these steps:
      
      1) create socket, bind and then setsockopt for add mc group.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("192.168.1.2");
         setsockopt(sockfd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));
      
      2) drop the mc group for this socket.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("0.0.0.0");
         setsockopt(sockfd, IPPROTO_IP, IP_DROP_MEMBERSHIP, &mreq, sizeof(mreq));
      
      3) and then drop the socket, I found the mc group was still used by the dev:
      
         netstat -g
      
         Interface       RefCnt Group
         --------------- ------ ---------------------
         eth2		   1	  255.0.0.37
      
      Normally even though the IP_DROP_MEMBERSHIP return error, the mc group still need
      to be released for the netdev when drop the socket, but this process was broken when
      route default is NULL, the reason is that:
      
      The ip_mc_leave_group() will choose the in_dev by the imr_interface.s_addr, if input addr
      is NULL, the default route dev will be chosen, then the ifindex is got from the dev,
      then polling the inet->mc_list and return -ENODEV, but if the default route dev is NULL,
      the in_dev and ifIndex is both NULL, when polling the inet->mc_list, the mc group will be
      released from the mc_list, but the dev didn't dec the refcnt for this mc group, so
      when dropping the socket, the mc_list is NULL and the dev still keep this group.
      
      v1->v2: According Hideaki's suggestion, we should align with IPv6 (RFC3493) and BSDs,
      	so I add the checking for the in_dev before polling the mc_list, make sure when
      	we remove the mc group, dec the refcnt to the real dev which was using the mc address.
      	The problem would never happened again.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbaf3a0b
    • Li RongQing's avatar
      8021q: fix a potential memory leak · d7f54732
      Li RongQing authored
      [ Upstream commit 916c1689 ]
      
      skb_cow called in vlan_reorder_header does not free the skb when it failed,
      and vlan_reorder_header returns NULL to reset original skb when it is called
      in vlan_untag, lead to a memory leak.
      Signed-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7f54732
    • Neal Cardwell's avatar
      tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb · 06c84e98
      Neal Cardwell authored
      [ Upstream commit 2cd0d743 ]
      
      If there is an MSS change (or misbehaving receiver) that causes a SACK
      to arrive that covers the end of an skb but is less than one MSS, then
      tcp_match_skb_to_sack() was rounding up pkt_len to the full length of
      the skb ("Round if necessary..."), then chopping all bytes off the skb
      and creating a zero-byte skb in the write queue.
      
      This was visible now because the recently simplified TLP logic in
      bef1909e ("tcp: fixing TLP's FIN recovery") could find that 0-byte
      skb at the end of the write queue, and now that we do not check that
      skb's length we could send it as a TLP probe.
      
      Consider the following example scenario:
      
       mss: 1000
       skb: seq: 0 end_seq: 4000  len: 4000
       SACK: start_seq: 3999 end_seq: 4000
      
      The tcp_match_skb_to_sack() code will compute:
      
       in_sack = false
       pkt_len = start_seq - TCP_SKB_CB(skb)->seq = 3999 - 0 = 3999
       new_len = (pkt_len / mss) * mss = (3999/1000)*1000 = 3000
       new_len += mss = 4000
      
      Previously we would find the new_len > skb->len check failing, so we
      would fall through and set pkt_len = new_len = 4000 and chop off
      pkt_len of 4000 from the 4000-byte skb, leaving a 0-byte segment
      afterward in the write queue.
      
      With this new commit, we notice that the new new_len >= skb->len check
      succeeds, so that we return without trying to fragment.
      
      Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06c84e98
    • Hugh Dickins's avatar
      shmem: fix splicing from a hole while it's punched · 21618a8f
      Hugh Dickins authored
      commit b1a36650 upstream.
      
      shmem_fault() is the actual culprit in trinity's hole-punch starvation,
      and the most significant cause of such problems: since a page faulted is
      one that then appears page_mapped(), needing unmap_mapping_range() and
      i_mmap_mutex to be unmapped again.
      
      But it is not the only way in which a page can be brought into a hole in
      the radix_tree while that hole is being punched; and Vlastimil's testing
      implies that if enough other processors are busy filling in the hole,
      then shmem_undo_range() can be kept from completing indefinitely.
      
      shmem_file_splice_read() is the main other user of SGP_CACHE, which can
      instantiate shmem pagecache pages in the read-only case (without holding
      i_mutex, so perhaps concurrently with a hole-punch).  Probably it's
      silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
      ought to be safe, but might bring surprises - not a change to be rushed.
      
      shmem_read_mapping_page_gfp() is an internal interface used by
      drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
      shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
      called internally by the kernel (perhaps for a stacking filesystem,
      which might rely on holes to be reserved): it's unclear whether it could
      be provoked to keep hole-punch busy or not.
      
      We could apply the same umbrella as now used in shmem_fault() to
      shmem_file_splice_read() and the others; but it looks ugly, and use over
      a range raises questions - should it actually be per page? can these get
      starved themselves?
      
      The origin of this part of the problem is my v3.1 commit d0823576
      ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
      into shmem.c.  It seemed like a nice idea at the time, to ensure
      (barring RCU lookup fuzziness) that there's an instant when the entire
      hole is empty; but the indefinitely repeated scans to ensure that make
      it vulnerable.
      
      Revert that "enhancement" to hole-punch from shmem_undo_range(), but
      retain the unproblematic rescanning when it's truncating; add a couple
      of comments there.
      
      Remove the "indices[0] >= end" test: that is now handled satisfactorily
      by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
      to be worth avoiding here.
      
      But if we do not always loop indefinitely, we do need to handle the case
      of swap swizzled back to page before shmem_free_swap() gets it: add a
      retry for that case, as suggested by Konstantin Khlebnikov; and for the
      case of page swizzled back to swap, as suggested by Johannes Weiner.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      21618a8f
    • Hugh Dickins's avatar
      shmem: fix faulting into a hole, not taking i_mutex · 0f5a4a00
      Hugh Dickins authored
      commit 8e205f77 upstream.
      
      Commit f00cdc6d ("shmem: fix faulting into a hole while it's
      punched") was buggy: Sasha sent a lockdep report to remind us that
      grabbing i_mutex in the fault path is a no-no (write syscall may already
      hold i_mutex while faulting user buffer).
      
      We tried a completely different approach (see following patch) but that
      proved inadequate: good enough for a rational workload, but not good
      enough against trinity - which forks off so many mappings of the object
      that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
      into serious starvation when concurrent faults force the puncher to fall
      back to single-page unmap_mapping_range() searches of the i_mmap tree.
      
      So return to the original umbrella approach, but keep away from i_mutex
      this time.  We really don't want to bloat every shmem inode with a new
      mutex or completion, just to protect this unlikely case from trinity.
      So extend the original with wait_queue_head on stack at the hole-punch
      end, and wait_queue item on the stack at the fault end.
      
      This involves further use of i_lock to guard against the races: lockdep
      has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
      i_lock around wake_up_bit(), which is comparable to what we do here.
      i_lock is more convenient, but we could switch to shmem's info->lock.
      
      This issue has been tagged with CVE-2014-4171, which will require commit
      f00cdc6d and this and the following patch to be backported: we
      suggest to 3.1+, though in fact the trinity forkbomb effect might go
      back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
      not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
      Anyone running trinity on 3.0 and earlier? I don't think we need care.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      0f5a4a00
    • Hugh Dickins's avatar
      shmem: fix faulting into a hole while it's punched · a62f374a
      Hugh Dickins authored
      commit f00cdc6d upstream.
      
      Trinity finds that mmap access to a hole while it's punched from shmem
      can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
      from completing, until the reader chooses to stop; with the puncher's
      hold on i_mutex locking out all other writers until it can complete.
      
      It appears that the tmpfs fault path is too light in comparison with its
      hole-punching path, lacking an i_data_sem to obstruct it; but we don't
      want to slow down the common case.
      
      Extend shmem_fallocate()'s existing range notification mechanism, so
      shmem_fault() can refrain from faulting pages into the hole while it's
      punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
      faulting when not).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      a62f374a