1. 06 Feb, 2016 12 commits
    • Kirill A. Shutemov's avatar
      mm: downgrade VM_BUG in isolate_lru_page() to warning · cf2a82ee
      Kirill A. Shutemov authored
      Calling isolate_lru_page() is wrong and shouldn't happen, but it not
      nessesary fatal: the page just will not be isolated if it's not on LRU.
      
      Let's downgrade the VM_BUG_ON_PAGE() to WARN_RATELIMIT().
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf2a82ee
    • Kirill A. Shutemov's avatar
      mempolicy: do not try to queue pages from !vma_migratable() · 77bf45e7
      Kirill A. Shutemov authored
      Maybe I miss some point, but I don't see a reason why we try to queue
      pages from non migratable VMAs.
      
      This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page():
      
          #include <fcntl.h>
          #include <unistd.h>
          #include <stdio.h>
          #include <sys/mman.h>
          #include <numaif.h>
      
          #define SIZE 0x2000
      
          int foo;
      
          int main()
          {
              int fd;
              char *p;
              unsigned long mask = 2;
      
              fd = open("/dev/sg0", O_RDWR);
              p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
              /* Faultin pages */
              foo = p[0] + p[0x1000];
              mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT);
              return 0;
          }
      
      The only case when we can queue pages from such VMA is MPOL_MF_STRICT
      plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
      but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
      in vma_migratable()).  That's looks like a bug to me.
      
      Let's filter out non-migratable vma at start of queue_pages_test_walk()
      and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
      MPOL_MF_MOVE_ALL flag is set.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77bf45e7
    • Tetsuo Handa's avatar
      mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress · 564e81a5
      Tetsuo Handa authored
      Jan Stancek has reported that system occasionally hanging after "oom01"
      testcase from LTP triggers OOM.  Guessing from a result that there is a
      kworker thread doing memory allocation and the values between "Node 0
      Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
      up-to-date for some reason.
      
      According to commit 373ccbe5 ("mm, vmstat: allow WQ concurrency to
      discover memory reclaim doesn't make any progress"), it meant to force
      the kworker thread to take a short sleep, but it by error used
      schedule_timeout(1).  We missed that schedule_timeout() in state
      TASK_RUNNING doesn't do anything.
      
      Fix it by using schedule_timeout_uninterruptible(1) which forces the
      kworker thread to take a short sleep in order to make sure that vmstat
      is up-to-date.
      
      Fixes: 373ccbe5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      564e81a5
    • Michal Hocko's avatar
      vmstat: make vmstat_update deferrable · ccde8bd4
      Michal Hocko authored
      Commit 0eb77e98 ("vmstat: make vmstat_updater deferrable again and
      shut down on idle") made vmstat_shepherd deferrable.  vmstat_update
      itself is still useing standard timer which might interrupt idle task.
      This is possible because "mm, vmstat: make quiet_vmstat lighter" removed
      cancel_delayed_work from the quiet_vmstat.
      
      Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless
      wakeups from the idle context.
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ccde8bd4
    • Michal Hocko's avatar
      mm, vmstat: make quiet_vmstat lighter · f01f17d3
      Michal Hocko authored
      Mike has reported a considerable overhead of refresh_cpu_vm_stats from
      the idle entry during pipe test:
      
          12.89%  [kernel]       [k] refresh_cpu_vm_stats.isra.12
           4.75%  [kernel]       [k] __schedule
           4.70%  [kernel]       [k] mutex_unlock
           3.14%  [kernel]       [k] __switch_to
      
      This is caused by commit 0eb77e98 ("vmstat: make vmstat_updater
      deferrable again and shut down on idle") which has placed quiet_vmstat
      into cpu_idle_loop.  The main reason here seems to be that the idle
      entry has to get over all zones and perform atomic operations for each
      vmstat entry even though there might be no per cpu diffs.  This is a
      pointless overhead for _each_ idle entry.
      
      Make sure that quiet_vmstat is as light as possible.
      
      First of all it doesn't make any sense to do any local sync if the
      current cpu is already set in oncpu_stat_off because vmstat_update puts
      itself there only if there is nothing to do.
      
      Then we can check need_update which should be a cheap way to check for
      potential per-cpu diffs and only then do refresh_cpu_vm_stats.
      
      The original patch also did cancel_delayed_work which we are not doing
      here.  There are two reasons for that.  Firstly cancel_delayed_work from
      idle context will blow up on RT kernels (reported by Mike):
      
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7
        Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
        Call Trace:
          dump_stack+0x49/0x67
          ___might_sleep+0xf5/0x180
          rt_spin_lock+0x20/0x50
          try_to_grab_pending+0x69/0x240
          cancel_delayed_work+0x26/0xe0
          quiet_vmstat+0x75/0xa0
          cpu_idle_loop+0x38/0x3e0
          cpu_startup_entry+0x13/0x20
          start_secondary+0x114/0x140
      
      And secondly, even on !RT kernels it might add some non trivial overhead
      which is not necessary.  Even if the vmstat worker wakes up and preempts
      idle then it will be most likely a single shot noop because the stats
      were already synced and so it would end up on the oncpu_stat_off anyway.
      We just need to teach both vmstat_shepherd and vmstat_update to stop
      scheduling the worker if there is nothing to do.
      
      [mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f01f17d3
    • Vlastimil Babka's avatar
      mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INIT · 1ce22103
      Vlastimil Babka authored
      The description mentions kswapd threads, while the deferred struct page
      initialization is actually done by one-off "pgdatinitX" threads.
      
      Fix the description so that potentially users are not confused about
      pgdatinit threads using CPU after boot instead of kswapd.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ce22103
    • David Gibson's avatar
      memblock: don't mark memblock_phys_mem_size() as __init · 1f1ffb8a
      David Gibson authored
      At the moment memblock_phys_mem_size() is marked as __init, and so is
      discarded after boot.  This is different from most of the memblock
      functions which are marked __init_memblock, and are only discarded after
      boot if memory hotplug is not configured.
      
      To allow for upcoming code which will need memblock_phys_mem_size() in
      the hotplug path, change it from __init to __init_memblock.
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f1ffb8a
    • Eric Dumazet's avatar
      dump_stack: avoid potential deadlocks · d7ce3692
      Eric Dumazet authored
      Some servers experienced fatal deadlocks because of a combination of
      bugs, leading to multiple cpus calling dump_stack().
      
      The checksumming bug was fixed in commit 34ae6a1a ("ipv6: update
      skb->csum when CE mark is propagated").
      
      The second problem is a faulty locking in dump_stack()
      
      CPU1 runs in process context and calls dump_stack(), grabs dump_lock.
      
         CPU2 receives a TCP packet under softirq, grabs socket spinlock, and
         call dump_stack() from netdev_rx_csum_fault().
      
         dump_stack() spins on atomic_cmpxchg(&dump_lock, -1, 2), since
         dump_lock is owned by CPU1
      
      While dumping its stack, CPU1 is interrupted by a softirq, and happens
      to process a packet for the TCP socket locked by CPU2.
      
      CPU1 spins forever in spin_lock() : deadlock
      
      Stack trace on CPU1 looked like :
      
          NMI backtrace for cpu 1
          RIP: _raw_spin_lock+0x25/0x30
          ...
          Call Trace:
            <IRQ>
            tcp_v6_rcv+0x243/0x620
            ip6_input_finish+0x11f/0x330
            ip6_input+0x38/0x40
            ip6_rcv_finish+0x3c/0x90
            ipv6_rcv+0x2a9/0x500
            process_backlog+0x461/0xaa0
            net_rx_action+0x147/0x430
            __do_softirq+0x167/0x2d0
            call_softirq+0x1c/0x30
            do_softirq+0x3f/0x80
            irq_exit+0x6e/0xc0
            smp_call_function_single_interrupt+0x35/0x40
            call_function_single_interrupt+0x6a/0x70
            <EOI>
            printk+0x4d/0x4f
            printk_address+0x31/0x33
            print_trace_address+0x33/0x3c
            print_context_stack+0x7f/0x119
            dump_trace+0x26b/0x28e
            show_trace_log_lvl+0x4f/0x5c
            show_stack_log_lvl+0x104/0x113
            show_stack+0x42/0x44
            dump_stack+0x46/0x58
            netdev_rx_csum_fault+0x38/0x3c
            __skb_checksum_complete_head+0x6e/0x80
            __skb_checksum_complete+0x11/0x20
            tcp_rcv_established+0x2bd5/0x2fd0
            tcp_v6_do_rcv+0x13c/0x620
            sk_backlog_rcv+0x15/0x30
            release_sock+0xd2/0x150
            tcp_recvmsg+0x1c1/0xfc0
            inet_recvmsg+0x7d/0x90
            sock_recvmsg+0xaf/0xe0
            ___sys_recvmsg+0x111/0x3b0
            SyS_recvmsg+0x5c/0xb0
            system_call_fastpath+0x16/0x1b
      
      Fixes: b58d9774 ("dump_stack: serialize the output from dump_stack()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7ce3692
    • Andrea Arcangeli's avatar
      mm: validate_mm browse_rb SMP race condition · acf128d0
      Andrea Arcangeli authored
      The mmap_sem for reading in validate_mm called from expand_stack is not
      enough to prevent the argumented rbtree rb_subtree_gap information to
      change from under us because expand_stack may be running from other
      threads concurrently which will hold the mmap_sem for reading too.
      
      The argumented rbtree is updated with vma_gap_update under the
      page_table_lock so use it in browse_rb() too to avoid false positives.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acf128d0
    • Sudip Mukherjee's avatar
      m32r: fix build failure due to SMP and MMU · af1ddcb5
      Sudip Mukherjee authored
      One of the randconfig build failed with the error:
      
        arch/m32r/kernel/smp.c: In function 'smp_flush_tlb_mm':
        arch/m32r/kernel/smp.c:283:20: error: subscripted value is neither array nor pointer nor vector
          mmc = &mm->context[cpu_id];
                            ^
        arch/m32r/kernel/smp.c: In function 'smp_flush_tlb_page':
        arch/m32r/kernel/smp.c:353:20: error: subscripted value is neither array nor pointer nor vector
          mmc = &mm->context[cpu_id];
                            ^
        arch/m32r/kernel/smp.c: In function 'smp_invalidate_interrupt':
        arch/m32r/kernel/smp.c:479:41: error: subscripted value is neither array nor pointer nor vector
          unsigned long *mmc = &flush_mm->context[cpu_id];
      
      It turned out that CONFIG_SMP was defined but CONFIG_MMU was not
      defined.  But arch/m32r/include/asm/mmu.h only defines mm_context_t as
      an array when both CONFIG_SMP and CONFIG_MMU are defined.  And
      arch/m32r/kernel/smp.c is always using context as an array.  So without
      MMU SMP can not work.
      Signed-off-by: default avatarSudip Mukherjee <sudip@vectorindia.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af1ddcb5
    • Ross Zwisler's avatar
      block: fix pfn_mkwrite() DAX fault handler · 9c5a05bc
      Ross Zwisler authored
      Previously the pfn_mkwrite() fault handler for raw block devices called
      bldev_dax_fault() -> __dax_fault() to do a full DAX page fault.
      
      Really what the pfn_mkwrite() fault handler needs to do is call
      dax_pfn_mkwrite() to make sure that the radix tree entry for the given
      PTE is marked as dirty so that a follow-up fsync or msync call will
      flush it durably to media.
      
      Fixes: 5a023cdb ("block: enable dax for raw block devices")
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c5a05bc
    • Sasha Levin's avatar
      signals: avoid random wakeups in sigsuspend() · 823dd322
      Sasha Levin authored
      A random wakeup can get us out of sigsuspend() without TIF_SIGPENDING
      being set.
      
      Avoid that by making sure we were signaled, like sys_pause() does.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      823dd322
  2. 04 Feb, 2016 3 commits
  3. 03 Feb, 2016 25 commits