1. 30 Aug, 2017 40 commits
    • Jeffy Chen's avatar
      Bluetooth: hidp: fix possible might sleep error in hidp_session_thread · e792d2d4
      Jeffy Chen authored
      commit 5da8e47d upstream.
      
      It looks like hidp_session_thread has same pattern as the issue reported in
      old rfcomm:
      
      	while (1) {
      		set_current_state(TASK_INTERRUPTIBLE);
      		if (condition)
      			break;
      		// may call might_sleep here
      		schedule();
      	}
      	__set_current_state(TASK_RUNNING);
      
      Which fixed at:
      	dfb2fae7 Bluetooth: Fix nested sleeps
      
      So let's fix it at the same way, also follow the suggestion of:
      https://lwn.net/Articles/628628/Signed-off-by: default avatarJeffy Chen <jeffy.chen@rock-chips.com>
      Tested-by: default avatarAL Yu-Chen Cho <acho@suse.com>
      Tested-by: default avatarRohit Vaswani <rvaswani@nvidia.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e792d2d4
    • Mateusz Jurczyk's avatar
      netfilter: nfnetlink: Improve input length sanitization in nfnetlink_rcv · 1eb33a1b
      Mateusz Jurczyk authored
      commit f55ce7b0 upstream.
      
      Verify that the length of the socket buffer is sufficient to cover the
      nlmsghdr structure before accessing the nlh->nlmsg_len field for further
      input sanitization. If the client only supplies 1-3 bytes of data in
      sk_buff, then nlh->nlmsg_len remains partially uninitialized and
      contains leftover memory from the corresponding kernel allocation.
      Operating on such data may result in indeterminate evaluation of the
      nlmsg_len < NLMSG_HDRLEN expression.
      
      The bug was discovered by a runtime instrumentation designed to detect
      use of uninitialized memory in the kernel. The patch prevents this and
      other similar tools (e.g. KMSAN) from flagging this behavior in the future.
      Signed-off-by: default avatarMateusz Jurczyk <mjurczyk@google.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1eb33a1b
    • Florian Westphal's avatar
      netfilter: nat: fix src map lookup · 8b504107
      Florian Westphal authored
      commit 97772bcd upstream.
      
      When doing initial conversion to rhashtable I replaced the bucket
      walk with a single rhashtable_lookup_fast().
      
      When moving to rhlist I failed to properly walk the list of identical
      tuples, but that is what is needed for this to work correctly.
      The table contains the original tuples, so the reply tuples are all
      distinct.
      
      We currently decide that mapping is (not) in range only based on the
      first entry, but in case its not we need to try the reply tuple of the
      next entry until we either find an in-range mapping or we checked
      all the entries.
      
      This bug makes nat core attempt collision resolution while it might be
      able to use the mapping as-is.
      
      Fixes: 870190a9 ("netfilter: nat: convert nat bysrc hash to rhashtable")
      Reported-by: default avatarJaco Kroon <jaco@uls.co.za>
      Tested-by: default avatarJaco Kroon <jaco@uls.co.za>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b504107
    • Florian Westphal's avatar
      netfilter: expect: fix crash when putting uninited expectation · f5263887
      Florian Westphal authored
      commit 36ac344e upstream.
      
      We crash in __nf_ct_expect_check, it calls nf_ct_remove_expect on the
      uninitialised expectation instead of existing one, so del_timer chokes
      on random memory address.
      
      Fixes: ec0e3f01 ("netfilter: nf_ct_expect: Add nf_ct_remove_expect()")
      Reported-by: default avatarSergey Kvachonok <ravenexp@gmail.com>
      Tested-by: default avatarSergey Kvachonok <ravenexp@gmail.com>
      Cc: Gao Feng <fgao@ikuai8.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5263887
    • Vadim Lomovtsev's avatar
      net: sunrpc: svcsock: fix NULL-pointer exception · 4909a7b7
      Vadim Lomovtsev authored
      commit eebe53e8 upstream.
      
      While running nfs/connectathon tests kernel NULL-pointer exception
      has been observed due to races in svcsock.c.
      
      Race is appear when kernel accepts connection by kernel_accept
      (which creates new socket) and start queuing ingress packets
      to new socket. This happens in ksoftirq context which could run
      concurrently on a different core while new socket setup is not done yet.
      
      The fix is to re-order socket user data init sequence and add
      write/read barrier calls to be sure that we got proper values
      for callback pointers before actually calling them.
      
      Test results: nfs/connectathon reports '0' failed tests for about 200+ iterations.
      
      Crash log:
      ---<-snip->---
      [ 6708.638984] Unable to handle kernel NULL pointer dereference at virtual address 00000000
      [ 6708.647093] pgd = ffff0000094e0000
      [ 6708.650497] [00000000] *pgd=0000010ffff90003, *pud=0000010ffff90003, *pmd=0000010ffff80003, *pte=0000000000000000
      [ 6708.660761] Internal error: Oops: 86000005 [#1] SMP
      [ 6708.665630] Modules linked in: nfsv3 nfnetlink_queue nfnetlink_log nfnetlink rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache overlay xt_CONNSECMARK xt_SECMARK xt_conntrack iptable_security ip_tables ah4 xfrm4_mode_transport sctp tun binfmt_misc ext4 jbd2 mbcache loop tcp_diag udp_diag inet_diag rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack vfat fat ghash_ce sha2_ce sha1_ce cavium_rng_vf i2c_thunderx sg thunderx_edac i2c_smbus edac_core cavium_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c nicvf nicpf ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
      [ 6708.736446]  ttm drm i2c_core thunder_bgx thunder_xcv mdio_thunder mdio_cavium dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_3c300909c5b3f46dcacd49aab3334af_87021]
      [ 6708.752275] CPU: 84 PID: 0 Comm: swapper/84 Tainted: G        W  OE   4.11.0-4.el7.aarch64 #1
      [ 6708.760787] Hardware name: www.cavium.com CRB-2S/CRB-2S, BIOS 0.3 Mar 13 2017
      [ 6708.767910] task: ffff810006842e80 task.stack: ffff81000689c000
      [ 6708.773822] PC is at 0x0
      [ 6708.776739] LR is at svc_data_ready+0x38/0x88 [sunrpc]
      [ 6708.781866] pc : [<0000000000000000>] lr : [<ffff0000029d7378>] pstate: 60000145
      [ 6708.789248] sp : ffff810ffbad3900
      [ 6708.792551] x29: ffff810ffbad3900 x28: ffff000008c73d58
      [ 6708.797853] x27: 0000000000000000 x26: ffff81000bbe1e00
      [ 6708.803156] x25: 0000000000000020 x24: ffff800f7410bf28
      [ 6708.808458] x23: ffff000008c63000 x22: ffff000008c63000
      [ 6708.813760] x21: ffff800f7410bf28 x20: ffff81000bbe1e00
      [ 6708.819063] x19: ffff810012412400 x18: 00000000d82a9df2
      [ 6708.824365] x17: 0000000000000000 x16: 0000000000000000
      [ 6708.829667] x15: 0000000000000000 x14: 0000000000000001
      [ 6708.834969] x13: 0000000000000000 x12: 722e736f622e676e
      [ 6708.840271] x11: 00000000f814dd99 x10: 0000000000000000
      [ 6708.845573] x9 : 7374687225000000 x8 : 0000000000000000
      [ 6708.850875] x7 : 0000000000000000 x6 : 0000000000000000
      [ 6708.856177] x5 : 0000000000000028 x4 : 0000000000000000
      [ 6708.861479] x3 : 0000000000000000 x2 : 00000000e5000000
      [ 6708.866781] x1 : 0000000000000000 x0 : ffff81000bbe1e00
      [ 6708.872084]
      [ 6708.873565] Process swapper/84 (pid: 0, stack limit = 0xffff81000689c000)
      [ 6708.880341] Stack: (0xffff810ffbad3900 to 0xffff8100068a0000)
      [ 6708.886075] Call trace:
      [ 6708.888513] Exception stack(0xffff810ffbad3710 to 0xffff810ffbad3840)
      [ 6708.894942] 3700:                                   ffff810012412400 0001000000000000
      [ 6708.902759] 3720: ffff810ffbad3900 0000000000000000 0000000060000145 ffff800f79300000
      [ 6708.910577] 3740: ffff000009274d00 00000000000003ea 0000000000000015 ffff000008c63000
      [ 6708.918395] 3760: ffff810ffbad3830 ffff800f79300000 000000000000004d 0000000000000000
      [ 6708.926212] 3780: ffff810ffbad3890 ffff0000080f88dc ffff800f79300000 000000000000004d
      [ 6708.934030] 37a0: ffff800f7930093c ffff000008c63000 0000000000000000 0000000000000140
      [ 6708.941848] 37c0: ffff000008c2c000 0000000000040b00 ffff81000bbe1e00 0000000000000000
      [ 6708.949665] 37e0: 00000000e5000000 0000000000000000 0000000000000000 0000000000000028
      [ 6708.957483] 3800: 0000000000000000 0000000000000000 0000000000000000 7374687225000000
      [ 6708.965300] 3820: 0000000000000000 00000000f814dd99 722e736f622e676e 0000000000000000
      [ 6708.973117] [<          (null)>]           (null)
      [ 6708.977824] [<ffff0000086f9fa4>] tcp_data_queue+0x754/0xc5c
      [ 6708.983386] [<ffff0000086fa64c>] tcp_rcv_established+0x1a0/0x67c
      [ 6708.989384] [<ffff000008704120>] tcp_v4_do_rcv+0x15c/0x22c
      [ 6708.994858] [<ffff000008707418>] tcp_v4_rcv+0xaf0/0xb58
      [ 6709.000077] [<ffff0000086df784>] ip_local_deliver_finish+0x10c/0x254
      [ 6709.006419] [<ffff0000086dfea4>] ip_local_deliver+0xf0/0xfc
      [ 6709.011980] [<ffff0000086dfad4>] ip_rcv_finish+0x208/0x3a4
      [ 6709.017454] [<ffff0000086e018c>] ip_rcv+0x2dc/0x3c8
      [ 6709.022328] [<ffff000008692fc8>] __netif_receive_skb_core+0x2f8/0xa0c
      [ 6709.028758] [<ffff000008696068>] __netif_receive_skb+0x38/0x84
      [ 6709.034580] [<ffff00000869611c>] netif_receive_skb_internal+0x68/0xdc
      [ 6709.041010] [<ffff000008696bc0>] napi_gro_receive+0xcc/0x1a8
      [ 6709.046690] [<ffff0000014b0fc4>] nicvf_cq_intr_handler+0x59c/0x730 [nicvf]
      [ 6709.053559] [<ffff0000014b1380>] nicvf_poll+0x38/0xb8 [nicvf]
      [ 6709.059295] [<ffff000008697a6c>] net_rx_action+0x2f8/0x464
      [ 6709.064771] [<ffff000008081824>] __do_softirq+0x11c/0x308
      [ 6709.070164] [<ffff0000080d14e4>] irq_exit+0x12c/0x174
      [ 6709.075206] [<ffff00000813101c>] __handle_domain_irq+0x78/0xc4
      [ 6709.081027] [<ffff000008081608>] gic_handle_irq+0x94/0x190
      [ 6709.086501] Exception stack(0xffff81000689fdf0 to 0xffff81000689ff20)
      [ 6709.092929] fde0:                                   0000810ff2ec0000 ffff000008c10000
      [ 6709.100747] fe00: ffff000008c70ef4 0000000000000001 0000000000000000 ffff810ffbad9b18
      [ 6709.108565] fe20: ffff810ffbad9c70 ffff8100169d3800 ffff810006843ab0 ffff81000689fe80
      [ 6709.116382] fe40: 0000000000000bd0 0000ffffdf979cd0 183f5913da192500 0000ffff8a254ce4
      [ 6709.124200] fe60: 0000ffff8a254b78 0000aaab10339808 0000000000000000 0000ffff8a0c2a50
      [ 6709.132018] fe80: 0000ffffdf979b10 ffff000008d6d450 ffff000008c10000 ffff000008d6d000
      [ 6709.139836] fea0: 0000000000000054 ffff000008cd3dbc 0000000000000000 0000000000000000
      [ 6709.147653] fec0: 0000000000000000 0000000000000000 0000000000000000 ffff81000689ff20
      [ 6709.155471] fee0: ffff000008085240 ffff81000689ff20 ffff000008085244 0000000060000145
      [ 6709.163289] ff00: ffff81000689ff10 ffff00000813f1e4 ffffffffffffffff ffff00000813f238
      [ 6709.171107] [<ffff000008082eb4>] el1_irq+0xb4/0x140
      [ 6709.175976] [<ffff000008085244>] arch_cpu_idle+0x44/0x11c
      [ 6709.181368] [<ffff0000087bf3b8>] default_idle_call+0x20/0x30
      [ 6709.187020] [<ffff000008116d50>] do_idle+0x158/0x1e4
      [ 6709.191973] [<ffff000008116ff4>] cpu_startup_entry+0x2c/0x30
      [ 6709.197624] [<ffff00000808e7cc>] secondary_start_kernel+0x13c/0x160
      [ 6709.203878] [<0000000001bc71c4>] 0x1bc71c4
      [ 6709.207967] Code: bad PC value
      [ 6709.211061] SMP: stopping secondary CPUs
      [ 6709.218830] Starting crashdump kernel...
      [ 6709.222749] Bye!
      ---<-snip>---
      Signed-off-by: default avatarVadim Lomovtsev <vlomovts@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4909a7b7
    • Eric Biggers's avatar
      x86/mm: Fix use-after-free of ldt_struct · a8da876c
      Eric Biggers authored
      commit ccd5b323 upstream.
      
      The following commit:
      
        39a0526f ("x86/mm: Factor out LDT init from context init")
      
      renamed init_new_context() to init_new_context_ldt() and added a new
      init_new_context() which calls init_new_context_ldt().  However, the
      error code of init_new_context_ldt() was ignored.  Consequently, if a
      memory allocation in alloc_ldt_struct() failed during a fork(), the
      ->context.ldt of the new task remained the same as that of the old task
      (due to the memcpy() in dup_mm()).  ldt_struct's are not intended to be
      shared, so a use-after-free occurred after one task exited.
      
      Fix the bug by making init_new_context() pass through the error code of
      init_new_context_ldt().
      
      This bug was found by syzkaller, which encountered the following splat:
      
          BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
          Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710
      
          CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:16 [inline]
           dump_stack+0x194/0x257 lib/dump_stack.c:52
           print_address_description+0x73/0x250 mm/kasan/report.c:252
           kasan_report_error mm/kasan/report.c:351 [inline]
           kasan_report+0x24e/0x340 mm/kasan/report.c:409
           __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
           free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
           free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
           destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
           destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
           __mmdrop+0xe9/0x530 kernel/fork.c:889
           mmdrop include/linux/sched/mm.h:42 [inline]
           exec_mmap fs/exec.c:1061 [inline]
           flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291
           load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855
           search_binary_handler+0x142/0x6b0 fs/exec.c:1652
           exec_binprm fs/exec.c:1694 [inline]
           do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816
           do_execve+0x31/0x40 fs/exec.c:1860
           call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100
           ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
      
          Allocated by task 3700:
           save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
           save_stack+0x43/0xd0 mm/kasan/kasan.c:447
           set_track mm/kasan/kasan.c:459 [inline]
           kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
           kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627
           kmalloc include/linux/slab.h:493 [inline]
           alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67
           write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277
           sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307
           entry_SYSCALL_64_fastpath+0x1f/0xbe
      
          Freed by task 3700:
           save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
           save_stack+0x43/0xd0 mm/kasan/kasan.c:447
           set_track mm/kasan/kasan.c:459 [inline]
           kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
           __cache_free mm/slab.c:3503 [inline]
           kfree+0xca/0x250 mm/slab.c:3820
           free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121
           free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
           destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
           destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
           __mmdrop+0xe9/0x530 kernel/fork.c:889
           mmdrop include/linux/sched/mm.h:42 [inline]
           __mmput kernel/fork.c:916 [inline]
           mmput+0x541/0x6e0 kernel/fork.c:927
           copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931
           copy_process kernel/fork.c:1546 [inline]
           _do_fork+0x1ef/0xfb0 kernel/fork.c:2025
           SYSC_clone kernel/fork.c:2135 [inline]
           SyS_clone+0x37/0x50 kernel/fork.c:2129
           do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287
           return_from_SYSCALL_64+0x0/0x7a
      
      Here is a C reproducer:
      
          #include <asm/ldt.h>
          #include <pthread.h>
          #include <signal.h>
          #include <stdlib.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
      
          static void *fork_thread(void *_arg)
          {
              fork();
          }
      
          int main(void)
          {
              struct user_desc desc = { .entry_number = 8191 };
      
              syscall(__NR_modify_ldt, 1, &desc, sizeof(desc));
      
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
      
                      srand(getpid());
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
                  }
                  wait(NULL);
              }
          }
      
      Note: the reproducer takes advantage of the fact that alloc_ldt_struct()
      may use vmalloc() to allocate a large ->entries array, and after
      commit:
      
        5d17a73a ("vmalloc: back off when the current task is killed")
      
      it is possible for userspace to fail a task's vmalloc() by
      sending a fatal signal, e.g. via exit_group().  It would be more
      difficult to reproduce this bug on kernels without that commit.
      
      This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Fixes: 39a0526f ("x86/mm: Factor out LDT init from context init")
      Link: http://lkml.kernel.org/r/20170824175029.76040-1-ebiggers3@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8da876c
    • Nicholas Piggin's avatar
      timers: Fix excessive granularity of new timers after a nohz idle · 2e11eede
      Nicholas Piggin authored
      commit 2fe59f50 upstream.
      
      When a timer base is idle, it is forwarded when a new timer is added
      to ensure that granularity does not become excessive. When not idle,
      the timer tick is expected to increment the base.
      
      However there are several problems:
      
      - If an existing timer is modified, the base is forwarded only after
        the index is calculated.
      
      - The base is not forwarded by add_timer_on.
      
      - There is a window after a timer is restarted from a nohz idle, after
        it is marked not-idle and before the timer tick on this CPU, where a
        timer may be added but the ancient base does not get forwarded.
      
      These result in excessive granularity (a 1 jiffy timeout can blow out
      to 100s of jiffies), which cause the rcu lockup detector to trigger,
      among other things.
      
      Fix this by keeping track of whether the timer base has been idle
      since it was last run or forwarded, and if so then forward it before
      adding a new timer.
      
      There is still a case where mod_timer optimises the case of a pending
      timer mod with the same expiry time, where the timer can see excessive
      granularity relative to the new, shorter interval. A comment is added,
      but it's not changed because it is an important fastpath for
      networking.
      
      This has been tested and found to fix the RCU softlockup messages.
      
      Testing was also done with tracing to measure requested versus
      achieved wakeup latencies for all non-deferrable timers in an idle
      system (with no lockup watchdogs running). Wakeup latency relative to
      absolute latency is calculated (note this suffers from round-up skew
      at low absolute times) and analysed:
      
                   max     avg      std
      upstream   506.0    1.20     4.68
      patched      2.0    1.08     0.15
      
      The bug was noticed due to the lockup detector Kconfig changes
      dropping it out of people's .configs and resulting in larger base
      clk skew When the lockup detectors are enabled, no CPU can go idle for
      longer than 4 seconds, which limits the granularity errors.
      Sub-optimal timer behaviour is observable on a smaller scale in that
      case:
      
      	     max     avg      std
      upstream     9.0    1.05     0.19
      patched      2.0    1.04     0.11
      
      Fixes: Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Tested-by: default avatarDavid Miller <davem@davemloft.net>
      Cc: dzickus@redhat.com
      Cc: sfr@canb.auug.org.au
      Cc: mpe@ellerman.id.au
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: linuxarm@huawei.com
      Cc: abdhalee@linux.vnet.ibm.com
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: akpm@linux-foundation.org
      Cc: paulmck@linux.vnet.ibm.com
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e11eede
    • Mark Rutland's avatar
      perf/core: Fix group {cpu,task} validation · 2c0dc7f0
      Mark Rutland authored
      commit 64aee2a9 upstream.
      
      Regardless of which events form a group, it does not make sense for the
      events to target different tasks and/or CPUs, as this leaves the group
      inconsistent and impossible to schedule. The core perf code assumes that
      these are consistent across (successfully intialised) groups.
      
      Core perf code only verifies this when moving SW events into a HW
      context. Thus, we can violate this requirement for pure SW groups and
      pure HW groups, unless the relevant PMU driver happens to perform this
      verification itself. These mismatched groups subsequently wreak havoc
      elsewhere.
      
      For example, we handle watchpoints as SW events, and reserve watchpoint
      HW on a per-CPU basis at pmu::event_init() time to ensure that any event
      that is initialised is guaranteed to have a slot at pmu::add() time.
      However, the core code only checks the group leader's cpu filter (via
      event_filter_match()), and can thus install follower events onto CPUs
      violating thier (mismatched) CPU filters, potentially installing them
      into a CPU without sufficient reserved slots.
      
      This can be triggered with the below test case, resulting in warnings
      from arch backends.
      
        #define _GNU_SOURCE
        #include <linux/hw_breakpoint.h>
        #include <linux/perf_event.h>
        #include <sched.h>
        #include <stdio.h>
        #include <sys/prctl.h>
        #include <sys/syscall.h>
        #include <unistd.h>
      
        static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu,
      			   int group_fd, unsigned long flags)
        {
      	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
        }
      
        char watched_char;
      
        struct perf_event_attr wp_attr = {
      	.type = PERF_TYPE_BREAKPOINT,
      	.bp_type = HW_BREAKPOINT_RW,
      	.bp_addr = (unsigned long)&watched_char,
      	.bp_len = 1,
      	.size = sizeof(wp_attr),
        };
      
        int main(int argc, char *argv[])
        {
      	int leader, ret;
      	cpu_set_t cpus;
      
      	/*
      	 * Force use of CPU0 to ensure our CPU0-bound events get scheduled.
      	 */
      	CPU_ZERO(&cpus);
      	CPU_SET(0, &cpus);
      	ret = sched_setaffinity(0, sizeof(cpus), &cpus);
      	if (ret) {
      		printf("Unable to set cpu affinity\n");
      		return 1;
      	}
      
      	/* open leader event, bound to this task, CPU0 only */
      	leader = perf_event_open(&wp_attr, 0, 0, -1, 0);
      	if (leader < 0) {
      		printf("Couldn't open leader: %d\n", leader);
      		return 1;
      	}
      
      	/*
      	 * Open a follower event that is bound to the same task, but a
      	 * different CPU. This means that the group should never be possible to
      	 * schedule.
      	 */
      	ret = perf_event_open(&wp_attr, 0, 1, leader, 0);
      	if (ret < 0) {
      		printf("Couldn't open mismatched follower: %d\n", ret);
      		return 1;
      	} else {
      		printf("Opened leader/follower with mismastched CPUs\n");
      	}
      
      	/*
      	 * Open as many independent events as we can, all bound to the same
      	 * task, CPU0 only.
      	 */
      	do {
      		ret = perf_event_open(&wp_attr, 0, 0, -1, 0);
      	} while (ret >= 0);
      
      	/*
      	 * Force enable/disble all events to trigger the erronoeous
      	 * installation of the follower event.
      	 */
      	printf("Opened all events. Toggling..\n");
      	for (;;) {
      		prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0);
      		prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0);
      	}
      
      	return 0;
        }
      
      Fix this by validating this requirement regardless of whether we're
      moving events.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zhou Chengming <zhouchengming1@huawei.com>
      Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c0dc7f0
    • Steven Rostedt (VMware)'s avatar
      ftrace: Check for null ret_stack on profile function graph entry function · aa2da6c4
      Steven Rostedt (VMware) authored
      commit a8f0f9e4 upstream.
      
      There's a small race when function graph shutsdown and the calling of the
      registered function graph entry callback. The callback must not reference
      the task's ret_stack without first checking that it is not NULL. Note, when
      a ret_stack is allocated for a task, it stays allocated until the task exits.
      The problem here, is that function_graph is shutdown, and a new task was
      created, which doesn't have its ret_stack allocated. But since some of the
      functions are still being traced, the callbacks can still be called.
      
      The normal function_graph code handles this, but starting with commit
      8861dd30 ("ftrace: Access ret_stack->subtime only in the function
      profiler") the profiler code references the ret_stack on function entry, but
      doesn't check if it is NULL first.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196611
      
      Fixes: 8861dd30 ("ftrace: Access ret_stack->subtime only in the function profiler")
      Reported-by: lilydjwg@gmail.com
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aa2da6c4
    • Christoph Hellwig's avatar
      virtio_pci: fix cpu affinity support · 1b8ca885
      Christoph Hellwig authored
      commit ba74b6f7 upstream.
      
      Commit 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for
      virtqueues"") removed the adjustment of the pre_vectors for the virtio
      MSI-X vector allocation which was added in commit fb5e31d9 ("virtio:
      allow drivers to request IRQ affinity when creating VQs"). This will
      lead to an incorrect assignment of MSI-X vectors, and potential
      deadlocks when offlining cpus.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 0b0f9dc5 ("Revert "virtio_pci: use shared interrupts for virtqueues")
      Reported-by: default avatarYASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1b8ca885
    • Steven Rostedt (VMware)'s avatar
      ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPU · 78f2e29f
      Steven Rostedt (VMware) authored
      commit a7e52ad7 upstream.
      
      Chunyu Hu reported:
        "per_cpu trace directories and files are created for all possible cpus,
         but only the cpus which have ever been on-lined have their own per cpu
         ring buffer (allocated by cpuhp threads). While trace_buffers_open, the
         open handler for trace file 'trace_pipe_raw' is always trying to access
         field of ring_buffer_per_cpu, and would panic with the NULL pointer.
      
         Align the behavior of trace_pipe_raw with trace_pipe, that returns -NODEV
         when openning it if that cpu does not have trace ring buffer.
      
         Reproduce:
         cat /sys/kernel/debug/tracing/per_cpu/cpu31/trace_pipe_raw
         (cpu31 is never on-lined, this is a 16 cores x86_64 box)
      
         Tested with:
         1) boot with maxcpus=14, read trace_pipe_raw of cpu15.
            Got -NODEV.
         2) oneline cpu15, read trace_pipe_raw of cpu15.
            Get the raw trace data.
      
         Call trace:
         [ 5760.950995] RIP: 0010:ring_buffer_alloc_read_page+0x32/0xe0
         [ 5760.961678]  tracing_buffers_read+0x1f6/0x230
         [ 5760.962695]  __vfs_read+0x37/0x160
         [ 5760.963498]  ? __vfs_read+0x5/0x160
         [ 5760.964339]  ? security_file_permission+0x9d/0xc0
         [ 5760.965451]  ? __vfs_read+0x5/0x160
         [ 5760.966280]  vfs_read+0x8c/0x130
         [ 5760.967070]  SyS_read+0x55/0xc0
         [ 5760.967779]  do_syscall_64+0x67/0x150
         [ 5760.968687]  entry_SYSCALL64_slow_path+0x25/0x25"
      
      This was introduced by the addition of the feature to reuse reader pages
      instead of re-allocating them. The problem is that the allocation of a
      reader page (which is per cpu) does not check if the cpu is online and set
      up for the ring buffer.
      
      Link: http://lkml.kernel.org/r/1500880866-1177-1-git-send-email-chuhu@redhat.com
      
      Fixes: 73a757e6 ("ring-buffer: Return reader page back into existing ring buffer")
      Reported-by: default avatarChunyu Hu <chuhu@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      78f2e29f
    • Chuck Lever's avatar
      nfsd: Limit end of page list when decoding NFSv4 WRITE · 8d4f126c
      Chuck Lever authored
      commit fc788f64 upstream.
      
      When processing an NFSv4 WRITE operation, argp->end should never
      point past the end of the data in the final page of the page list.
      Otherwise, nfsd4_decode_compound can walk into uninitialized memory.
      
      More critical, nfsd4_decode_write is failing to increment argp->pagelen
      when it increments argp->pagelist.  This can cause later xdr decoders
      to assume more data is available than really is, which can cause server
      crashes on malformed requests.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d4f126c
    • Ronnie Sahlberg's avatar
      cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup() · ea5745a5
      Ronnie Sahlberg authored
      commit d3edede2 upstream.
      
      Add checking for the path component length and verify it is <= the maximum
      that the server advertizes via FileFsAttributeInformation.
      
      With this patch cifs.ko will now return ENAMETOOLONG instead of ENOENT
      when users to access an overlong path.
      
      To test this, try to cd into a (non-existing) directory on a CIFS share
      that has a too long name:
      cd /mnt/aaaaaaaaaaaaaaa...
      
      and it now should show a good error message from the shell:
      bash: cd: /mnt/aaaaaaaaaaaaaaaa...aaaaaa: File name too long
      
      rh bz 1153996
      Signed-off-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea5745a5
    • Sachin Prabhu's avatar
      cifs: Fix df output for users with quota limits · 1bc1c439
      Sachin Prabhu authored
      commit 42bec214 upstream.
      
      The df for a SMB2 share triggers a GetInfo call for
      FS_FULL_SIZE_INFORMATION. The values returned are used to populate
      struct statfs.
      
      The problem is that none of the information returned by the call
      contains the total blocks available on the filesystem. Instead we use
      the blocks available to the user ie. quota limitation when filling out
      statfs.f_blocks. The information returned does contain Actual free units
      on the filesystem and is used to populate statfs.f_bfree. For users with
      quota enabled, it can lead to situations where the total free space
      reported is more than the total blocks on the system ending up with df
      reports like the following
      
       # df -h /mnt/a
      Filesystem         Size  Used Avail Use% Mounted on
      //192.168.22.10/a  2.5G -2.3G  2.5G    - /mnt/a
      
      To fix this problem, we instead populate both statfs.f_bfree with the
      same value as statfs.f_bavail ie. CallerAvailableAllocationUnits. This
      is similar to what is done already in the code for cifs and df now
      reports the quota information for the user used to mount the share.
      
       # df --si /mnt/a
      Filesystem         Size  Used Avail Use% Mounted on
      //192.168.22.10/a  2.7G  101M  2.6G   4% /mnt/a
      Signed-off-by: default avatarSachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: default avatarPierguido Lambri <plambri@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bc1c439
    • Nicholas Piggin's avatar
      kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured · 3b278d7e
      Nicholas Piggin authored
      commit cb87481e upstream.
      
      The .data and .bss sections were modified in the generic linker script to
      pull in sections named .data.<C identifier>, which are generated by gcc with
      -ffunction-sections and -fdata-sections options.
      
      The problem with this pattern is it can also match section names that Linux
      defines explicitly, e.g., .data.unlikely. This can cause Linux sections to
      get moved into the wrong place.
      
      The way to avoid this is to use ".." separators for explicit section names
      (the dot character is valid in a section name but not a C identifier).
      However currently there are sections which don't follow this rule, so for
      now just disable the wild card by default.
      
      Example: http://marc.info/?l=linux-arm-kernel&m=150106824024221&w=2
      
      Fixes: b67067f1 ("kbuild: allow archs to select link dead code/data elimination")
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b278d7e
    • Bharat Potnuri's avatar
      RDMA/uverbs: Initialize cq_context appropriately · 51f49383
      Bharat Potnuri authored
      commit 65159c05 upstream.
      
      Initializing cq_context with ev_queue in create_cq(), leads to NULL pointer
      dereference in ib_uverbs_comp_handler(), if application doesnot use completion
      channel. This patch fixes the cq_context initialization.
      
      Fixes: 1e7710f3 ("IB/core: Change completion channel to use the reworked")
      Signed-off-by: default avatarPotnuri Bharat Teja <bharat@chelsio.com>
      Reviewed-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      (cherry picked from commit 699a2d5b)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51f49383
    • Steven Rostedt (VMware)'s avatar
      tracing: Fix freeing of filter in create_filter() when set_str is false · 53a38dfb
      Steven Rostedt (VMware) authored
      commit 8b0db1a5 upstream.
      
      Performing the following task with kmemleak enabled:
      
       # cd /sys/kernel/tracing/events/irq/irq_handler_entry/
       # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger
       # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger
       # echo scan > /sys/kernel/debug/kmemleak
       # cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff8800b9290308 (size 32):
        comm "bash", pid 1114, jiffies 4294848451 (age 141.139s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290
          [<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940
          [<ffffffff812639c9>] create_filter+0xa9/0x160
          [<ffffffff81263bdc>] create_event_filter+0xc/0x10
          [<ffffffff812655e5>] set_trigger_filter+0xe5/0x210
          [<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490
          [<ffffffff812652e2>] event_trigger_write+0x1a2/0x260
          [<ffffffff8138cf87>] __vfs_write+0xd7/0x380
          [<ffffffff8138f421>] vfs_write+0x101/0x260
          [<ffffffff8139187b>] SyS_write+0xab/0x130
          [<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The function create_filter() is passed a 'filterp' pointer that gets
      allocated, and if "set_str" is true, it is up to the caller to free it, even
      on error. The problem is that the pointer is not freed by create_filter()
      when set_str is false. This is a bug, and it is not up to the caller to free
      the filter on error if it doesn't care about the string.
      
      Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com
      
      Fixes: 38b78eb8 ("tracing: Factorize filter creation")
      Reported-by: default avatarChunyu Hu <chuhu@redhat.com>
      Tested-by: default avatarChunyu Hu <chuhu@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53a38dfb
    • Chunyu Hu's avatar
      tracing: Fix kmemleak in tracing_map_array_free() · 983ba814
      Chunyu Hu authored
      commit 475bb3c6 upstream.
      
      kmemleak reported the below leak when I was doing clear of the hist
      trigger. With this patch, the kmeamleak is gone.
      
      unreferenced object 0xffff94322b63d760 (size 32):
        comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
        hex dump (first 32 bytes):
          00 01 00 00 04 00 00 00 08 00 00 00 ff 00 00 00  ................
          10 00 00 00 00 00 00 00 80 a8 7a f2 31 94 ff ff  ..........z.1...
        backtrace:
          [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff9e424cba>] kmem_cache_alloc_trace+0xca/0x1d0
          [<ffffffff9e377736>] tracing_map_array_alloc+0x26/0x140
          [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
          [<ffffffff9e38b935>] create_hist_data+0x535/0x750
          [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
          [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
          [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
          [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
          [<ffffffff9e450b85>] SyS_write+0x55/0xc0
          [<ffffffff9e203857>] do_syscall_64+0x67/0x150
          [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff9431f27aa880 (size 128):
        comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
        hex dump (first 32 bytes):
          00 00 8c 2a 32 94 ff ff 00 f0 8b 2a 32 94 ff ff  ...*2......*2...
          00 e0 8b 2a 32 94 ff ff 00 d0 8b 2a 32 94 ff ff  ...*2......*2...
        backtrace:
          [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff9e425348>] __kmalloc+0xe8/0x220
          [<ffffffff9e3777c1>] tracing_map_array_alloc+0xb1/0x140
          [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
          [<ffffffff9e38b935>] create_hist_data+0x535/0x750
          [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
          [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
          [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
          [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
          [<ffffffff9e450b85>] SyS_write+0x55/0xc0
          [<ffffffff9e203857>] do_syscall_64+0x67/0x150
          [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/1502705898-27571-1-git-send-email-chuhu@redhat.com
      
      Fixes: 08d43a5f ("tracing: Add lock-free tracing_map")
      Signed-off-by: default avatarChunyu Hu <chuhu@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      983ba814
    • Dan Carpenter's avatar
      tracing: Missing error code in tracer_alloc_buffers() · a23e7828
      Dan Carpenter authored
      commit 147d88e0 upstream.
      
      If ring_buffer_alloc() or one of the next couple function calls fail
      then we should return -ENOMEM but the current code returns success.
      
      Link: http://lkml.kernel.org/r/20170801110201.ajdkct7vwzixahvx@mwanda
      
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Fixes: b32614c0 ('tracing/rb: Convert to hotplug state machine')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a23e7828
    • Steven Rostedt (VMware)'s avatar
      tracing: Call clear_boot_tracer() at lateinit_sync · 3888c3ae
      Steven Rostedt (VMware) authored
      commit 4bb0f0e7 upstream.
      
      The clear_boot_tracer function is used to reset the default_bootup_tracer
      string to prevent it from being accessed after boot, as it originally points
      to init data. But since clear_boot_tracer() is called via the
      init_lateinit() call, it races with the initcall for registering the hwlat
      tracer. If someone adds "ftrace=hwlat" to the kernel command line, depending
      on how the linker sets up the text, the saved command line may be cleared,
      and the hwlat tracer never is initialized.
      
      Simply have the clear_boot_tracer() be called by initcall_lateinit_sync() as
      that's for tasks to be called after lateinit.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196551
      
      Fixes: e7c15cd8 ("tracing: Added hardware latency tracer")
      Reported-by: default avatarZamir SUN <sztsian@gmail.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3888c3ae
    • Sakari Ailus's avatar
      ACPI: device property: Fix node lookup in acpi_graph_get_child_prop_value() · 1344db83
      Sakari Ailus authored
      commit b5212f57 upstream.
      
      acpi_graph_get_child_prop_value() is intended to find a child node with a
      certain property value pair. The check
      
      	if (!fwnode_property_read_u32(fwnode, prop_name, &nr))
      		continue;
      
      is faulty: fwnode_property_read_u32() returns zero on success, not on
      failure, leading to comparing values only if the searched property was not
      found.
      
      Moreover, the check is made against the parent device node instead of
      the child one as it should be.
      
      Fixes: 79389a83 (ACPI / property: Add support for remote endpoints)
      Reported-by: default avatarHyungwoo Yang <hyungwoo.yang@intel.com>
      Signed-off-by: default avatarSakari Ailus <sakari.ailus@linux.intel.com>
      [ rjw: Changelog ]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1344db83
    • Alex Deucher's avatar
      Revert "drm/amdgpu: fix vblank_time when displays are off" · dbe5b2d7
      Alex Deucher authored
      This reverts commit 2dc1889e.
      
      Fixes a suspend and resume regression.
      
      bug: https://bugzilla.kernel.org/show_bug.cgi?id=196615Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dbe5b2d7
    • fred gao's avatar
      drm/i915/gvt: Fix the kernel null pointer error · 4ac9a5da
      fred gao authored
      commit ffeaf9aa upstream.
      
      once error happens in shadow_indirect_ctx function, the variable
      wa_ctx->indirect_ctx.obj is not initialized but accessed, so the
      kernel null point panic occurs.
      
      Fixes: 894cf7d1 ("drm/i915/gvt: i915_gem_object_create() returns an error pointer")
      Signed-off-by: default avatarfred gao <fred.gao@intel.com>
      Signed-off-by: default avatarZhenyu Wang <zhenyuw@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ac9a5da
    • Jani Nikula's avatar
      drm/i915/vbt: ignore extraneous child devices for a port · bbb04b37
      Jani Nikula authored
      commit 7c648bde upstream.
      
      Ever since we've parsed VBT child devices, starting from 6acab15a
      ("drm/i915: use the HDMI DDI buffer translations from VBT"), we've
      ignored the child device information if more than one child device
      references the same port. The rationale for this seems lost in time.
      
      Since commit 311a2094 ("drm/i915: don't init DP or HDMI when not
      supported by DDI port") we started using this information more to skip
      HDMI/DP init if the port wasn't there per VBT child devices. However, at
      the same time it added port defaults without further explanation.
      
      Thus, if the child device info was skipped due to multiple child devices
      referencing the same port, the device info would be retrieved from the
      somewhat arbitrary defaults.
      
      Finally, when commit bb1d1329 ("drm/i915/vbt: split out defaults
      that are set when there is no VBT") stopped initializing the defaults
      whenever VBT is present, thus trusting the VBT more, we stopped
      initializing ports which were referenced by more than one child device.
      
      Apparently at least Asus UX305UA, UX305U, and UX306U laptops have VBT
      child device blocks which cause this behaviour. Arguably they were
      shipped with a broken VBT.
      
      Relax the rules for multiple references to the same port, and use the
      first child device info to reference a port. Retain the logic to debug
      log about this, though.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101745
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196233
      Fixes: bb1d1329 ("drm/i915/vbt: split out defaults that are set when there is no VBT")
      Tested-by: default avatarOliver Weißbarth <mail@oweissbarth.de>
      Reported-by: default avatarOliver Weißbarth <mail@oweissbarth.de>
      Reported-by: default avatarDidier G <didierg-divers@orange.fr>
      Reported-by: default avatarGiles Anderson <agander@gmail.com>
      Cc: Manasi Navare <manasi.d.navare@intel.com>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
      Reviewed-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: default avatarJani Nikula <jani.nikula@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20170811113907.6716-1-jani.nikula@intel.comSigned-off-by: default avatarJani Nikula <jani.nikula@intel.com>
      (cherry picked from commit b5273d72)
      Signed-off-by: default avatarJani Nikula <jani.nikula@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bbb04b37
    • Maarten Lankhorst's avatar
      drm/atomic: If the atomic check fails, return its value first · d76df456
      Maarten Lankhorst authored
      commit a0ffc51e upstream.
      
      The last part of drm_atomic_check_only is testing whether we need to
      fail with -EINVAL when modeset is not allowed, but forgets to return
      the value when atomic_check() fails first.
      
      This results in -EDEADLK being replaced by -EINVAL, and the sanity
      check in drm_modeset_drop_locks kicks in:
      
      [  308.531734] ------------[ cut here ]------------
      [  308.531791] WARNING: CPU: 0 PID: 1886 at drivers/gpu/drm/drm_modeset_lock.c:217 drm_modeset_drop_locks+0x33/0xc0 [drm]
      [  308.531828] Modules linked in:
      [  308.532050] CPU: 0 PID: 1886 Comm: kms_atomic Tainted: G     U  W 4.13.0-rc5-patser+ #5225
      [  308.532082] Hardware name: NUC5i7RYB, BIOS RYBDWi35.86A.0246.2015.0309.1355 03/09/2015
      [  308.532124] task: ffff8800cd9dae00 task.stack: ffff8800ca3b8000
      [  308.532168] RIP: 0010:drm_modeset_drop_locks+0x33/0xc0 [drm]
      [  308.532189] RSP: 0018:ffff8800ca3bf980 EFLAGS: 00010282
      [  308.532211] RAX: dffffc0000000000 RBX: ffff8800ca3bfaf8 RCX: 0000000013a171e6
      [  308.532235] RDX: 1ffff10019477f69 RSI: ffffffffa8ba4fa0 RDI: ffff8800ca3bfb48
      [  308.532258] RBP: ffff8800ca3bf998 R08: 0000000000000000 R09: 0000000000000003
      [  308.532281] R10: 0000000079dbe066 R11: 00000000f760b34b R12: 0000000000000001
      [  308.532304] R13: dffffc0000000000 R14: 00000000ffffffea R15: ffff880096889680
      [  308.532328] FS:  00007ff00959cec0(0000) GS:ffff8800d4e00000(0000) knlGS:0000000000000000
      [  308.532359] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  308.532380] CR2: 0000000000000008 CR3: 00000000ca2e3000 CR4: 00000000003406f0
      [  308.532402] Call Trace:
      [  308.532440]  drm_mode_atomic_ioctl+0x19fa/0x1c00 [drm]
      [  308.532488]  ? drm_atomic_set_property+0x1220/0x1220 [drm]
      [  308.532565]  ? avc_has_extended_perms+0xc39/0xff0
      [  308.532593]  ? lock_downgrade+0x610/0x610
      [  308.532640]  ? drm_atomic_set_property+0x1220/0x1220 [drm]
      [  308.532680]  drm_ioctl_kernel+0x154/0x1a0 [drm]
      [  308.532755]  drm_ioctl+0x624/0x8f0 [drm]
      [  308.532858]  ? drm_atomic_set_property+0x1220/0x1220 [drm]
      [  308.532976]  ? drm_getunique+0x210/0x210 [drm]
      [  308.533061]  do_vfs_ioctl+0xd92/0xe40
      [  308.533121]  ? ioctl_preallocate+0x1b0/0x1b0
      [  308.533160]  ? selinux_capable+0x20/0x20
      [  308.533191]  ? do_fcntl+0x1b1/0xbf0
      [  308.533219]  ? kasan_slab_free+0xa2/0xb0
      [  308.533249]  ? f_getown+0x4b/0xa0
      [  308.533278]  ? putname+0xcf/0xe0
      [  308.533309]  ? security_file_ioctl+0x57/0x90
      [  308.533342]  SyS_ioctl+0x4e/0x80
      [  308.533374]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  308.533405] RIP: 0033:0x7ff00779e4d7
      [  308.533431] RSP: 002b:00007fff66a043d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [  308.533481] RAX: ffffffffffffffda RBX: 000000e7c7ca5910 RCX: 00007ff00779e4d7
      [  308.533560] RDX: 00007fff66a04430 RSI: 00000000c03864bc RDI: 0000000000000003
      [  308.533608] RBP: 00007ff007a5fb00 R08: 000000e7c7ca4620 R09: 000000e7c7ca5e60
      [  308.533647] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000070
      [  308.533685] R13: 0000000000000000 R14: 0000000000000000 R15: 000000e7c7ca5930
      [  308.533770] Code: ff df 55 48 89 e5 41 55 41 54 53 48 89 fb 48 83 c7
      50 48 89 fa 48 c1 ea 03 80 3c 02 00 74 05 e8 94 d4 16 e7 48 83 7b 50 00
      74 02 <0f> ff 4c 8d 6b 58 48 b8 00 00 00 00 00 fc ff df 4c 89 ea 48 c1
      [  308.534086] ---[ end trace 77f11e53b1df44ad ]---
      
      Solve this by adding the missing return.
      
      This is also a bugfix because we could end up rejecting updates with
      -EINVAL because of a early -EDEADLK, while if atomic_check ran to
      completion it might have downgraded the modeset to a fastset.
      Signed-off-by: default avatarMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Testcase: kms_atomic
      Link: https://patchwork.freedesktop.org/patch/msgid/20170815095706.23624-1-maarten.lankhorst@linux.intel.com
      Fixes: d34f20d6 ("drm: Atomic modeset ioctl")
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d76df456
    • Maarten Lankhorst's avatar
      drm/atomic: Handle -EDEADLK with out-fences correctly · 247122f1
      Maarten Lankhorst authored
      commit 7f5d6dac upstream.
      
      complete_crtc_signaling is freeing fence_state, but when retrying
      num_fences and fence_state are not zero'd. This caused duplicate
      fd's in the fence_state array, followed by a BUG_ON in fs/file.c
      because we reallocate freed memory, and installing over an existing
      fd, or potential other fun.
      
      Zero fence_state and num_fences correctly in the retry loop, which
      allows kms_atomic_transition to pass.
      
      Fixes: beaf5af4 ("drm/fence: add out-fences support")
      Cc: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
      Cc: Brian Starkey <brian.starkey@arm.com> (v10)
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Signed-off-by: default avatarMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Testcase: kms_atomic_transitions.plane-all-modeset-transition-fencing
      (with CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y)
      Link: https://patchwork.freedesktop.org/patch/msgid/20170814100721.13340-1-maarten.lankhorst@linux.intel.com
      Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> #intel-gfx on irc
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      247122f1
    • Jonathan Liu's avatar
      drm/sun4i: Implement drm_driver lastclose to restore fbdev console · d4ae641c
      Jonathan Liu authored
      commit 2a596fc9 upstream.
      
      The drm_driver lastclose callback is called when the last userspace
      DRM client has closed. Call drm_fbdev_cma_restore_mode to restore
      the fbdev console otherwise the fbdev console will stop working.
      
      Fixes: 9026e0d1 ("drm: Add Allwinner A10 Display Engine support")
      Tested-by: default avatarOlliver Schinagl <oliver@schinagl.nl>
      Reviewed-by: default avatarChen-Yu Tsai <wens@csie.org>
      Signed-off-by: default avatarJonathan Liu <net147@gmail.com>
      Signed-off-by: default avatarMaxime Ripard <maxime.ripard@free-electrons.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4ae641c
    • Chris Wilson's avatar
      drm: Release driver tracking before making the object available again · 08353913
      Chris Wilson authored
      commit fe4600a5 upstream.
      
      This is the same bug as we fixed in commit f6cd7dae ("drm: Release
      driver references to handle before making it available again"), but now
      the exposure is via the PRIME lookup tables. If we remove the
      object/handle from the PRIME lut, then a new request for the same
      object/fd will generate a new handle, thus for a short window that
      object is known to userspace by two different handles. Fix this by
      releasing the driver tracking before PRIME.
      
      Fixes: 0ff926c7 ("drm/prime: add exported buffers to current fprivs
      imported buffer list (v2)")
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20170819120558.6465-1-chris@chris-wilson.co.ukSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08353913
    • Nikhil Mahale's avatar
      drm: Fix framebuffer leak · b96c1565
      Nikhil Mahale authored
      commit 491ab470 upstream.
      
      Do not leak framebuffer if client provided crtc id found invalid.
      Signed-off-by: default avatarNikhil Mahale <nmahale@nvidia.com>
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/1502250781-5779-1-git-send-email-nmahale@nvidia.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b96c1565
    • Dave Martin's avatar
      arm64: fpsimd: Prevent registers leaking across exec · 865d89f8
      Dave Martin authored
      commit 09662210 upstream.
      
      There are some tricky dependencies between the different stages of
      flushing the FPSIMD register state during exec, and these can race
      with context switch in ways that can cause the old task's regs to
      leak across.  In particular, a context switch during the memset() can
      cause some of the task's old FPSIMD registers to reappear.
      
      Disabling preemption for this small window would be no big deal for
      performance: preemption is already disabled for similar scenarios
      like updating the FPSIMD registers in sigreturn.
      
      So, instead of rearranging things in ways that might swap existing
      subtle bugs for new ones, this patch just disables preemption
      around the FPSIMD state flushing so that races of this type can't
      occur here.  This brings fpsimd_flush_thread() into line with other
      code paths.
      
      Fixes: 674c242c ("arm64: flush FP/SIMD state correctly after execve()")
      Reviewed-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarDave Martin <Dave.Martin@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      865d89f8
    • Pavel Tatashin's avatar
      mm/memblock.c: reversed logic in memblock_discard() · 1c229d7a
      Pavel Tatashin authored
      commit 91b540f9 upstream.
      
      In recently introduced memblock_discard() there is a reversed logic bug.
      Memory is freed of static array instead of dynamically allocated one.
      
      Link: http://lkml.kernel.org/r/1503511441-95478-2-git-send-email-pasha.tatashin@oracle.com
      Fixes: 3010f876 ("mm: discard memblock data later")
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reported-by: default avatarWoody Suwalski <terraluna977@gmail.com>
      Tested-by: default avatarWoody Suwalski <terraluna977@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c229d7a
    • Eric Biggers's avatar
      fork: fix incorrect fput of ->exe_file causing use-after-free · f5024bb3
      Eric Biggers authored
      commit 2b7e8665 upstream.
      
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      a reference is taken on the mm_struct's ->exe_file.  Since the
      ->exe_file of the new mm_struct was already set to the old ->exe_file by
      the memcpy() in dup_mm(), it was possible for the mmput() in the error
      path of dup_mm() to drop a reference to ->exe_file which was never
      taken.
      
      This caused the struct file to later be freed prematurely.
      
      Fix it by updating mm_init() to NULL out the ->exe_file, in the same
      place it clears other things like the list of mmaps.
      
      This bug was found by syzkaller.  It can be reproduced using the
      following C program:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
      
          static void *mmap_thread(void *_arg)
          {
              for (;;) {
                  mmap(NULL, 0x1000000, PROT_READ,
                       MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
              }
          }
      
          static void *fork_thread(void *_arg)
          {
              usleep(rand() % 10000);
              fork();
          }
      
          int main(void)
          {
              fork();
              fork();
              fork();
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
      
                      pthread_create(&t, NULL, mmap_thread, NULL);
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
                  }
                  wait(NULL);
              }
          }
      
      No special kernel config options are needed.  It usually causes a NULL
      pointer dereference in __remove_shared_vm_struct() during exit, or in
      dup_mmap() (which is usually inlined into copy_process()) during fork.
      Both are due to a vm_area_struct's ->vm_file being used after it's
      already been freed.
      
      Google Bug Id: 64772007
      
      Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5024bb3
    • Eric Biggers's avatar
      mm/madvise.c: fix freeing of locked page with MADV_FREE · 4823f463
      Eric Biggers authored
      commit 263630e8 upstream.
      
      If madvise(..., MADV_FREE) split a transparent hugepage, it called
      put_page() before unlock_page().
      
      This was wrong because put_page() can free the page, e.g. if a
      concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
      mapping. put_page() then rightfully complained about freeing a locked
      page.
      
      Fix this by moving the unlock_page() before put_page().
      
      This bug was found by syzkaller, which encountered the following splat:
      
          BUG: Bad page state in process syzkaller412798  pfn:1bd800
          page:ffffea0006f60000 count:0 mapcount:0 mapping:          (null) index:0x20a00
          flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
          raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
          raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
          page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
          bad because of flags: 0x1(locked)
          Modules linked in:
          CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:16 [inline]
           dump_stack+0x194/0x257 lib/dump_stack.c:52
           bad_page+0x230/0x2b0 mm/page_alloc.c:565
           free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
           free_pages_check mm/page_alloc.c:952 [inline]
           free_pages_prepare mm/page_alloc.c:1043 [inline]
           free_pcp_prepare mm/page_alloc.c:1068 [inline]
           free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
           __put_single_page mm/swap.c:79 [inline]
           __put_page+0xfb/0x160 mm/swap.c:113
           put_page include/linux/mm.h:814 [inline]
           madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
           walk_pmd_range mm/pagewalk.c:50 [inline]
           walk_pud_range mm/pagewalk.c:108 [inline]
           walk_p4d_range mm/pagewalk.c:134 [inline]
           walk_pgd_range mm/pagewalk.c:160 [inline]
           __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
           walk_page_range+0x200/0x470 mm/pagewalk.c:326
           madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
           madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
           madvise_dontneed_free mm/madvise.c:555 [inline]
           madvise_vma mm/madvise.c:664 [inline]
           SYSC_madvise mm/madvise.c:832 [inline]
           SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
           entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Here is a C reproducer:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <sys/mman.h>
          #include <unistd.h>
      
          #define MADV_FREE	8
          #define PAGE_SIZE	4096
      
          static void *mapping;
          static const size_t mapping_size = 0x1000000;
      
          static void *madvise_thrproc(void *arg)
          {
              madvise(mapping, mapping_size, (long)arg);
          }
      
          int main(void)
          {
              pthread_t t[2];
      
              for (;;) {
                  mapping = mmap(NULL, mapping_size, PROT_WRITE,
                                 MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      
                  munmap(mapping + mapping_size / 2, PAGE_SIZE);
      
                  pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
                  pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
                  pthread_join(t[0], NULL);
                  pthread_join(t[1], NULL);
                  munmap(mapping, mapping_size);
              }
          }
      
      Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
      CONFIG_DEBUG_VM=y are needed.
      
      Google Bug Id: 64696096
      
      Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
      Fixes: 854e9ed0 ("mm: support madvise(MADV_FREE)")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4823f463
    • Ulf Hansson's avatar
      i2c: designware: Fix system suspend · c237efed
      Ulf Hansson authored
      commit a23318fe upstream.
      
      The commit 8503ff16 ("i2c: designware: Avoid unnecessary resuming
      during system suspend"), may suggest to the PM core to try out the so
      called direct_complete path for system sleep. In this path, the PM core
      treats a runtime suspended device as it's already in a proper low power
      state for system sleep, which makes it skip calling the system sleep
      callbacks for the device, except for the ->prepare() and the ->complete()
      callbacks.
      
      However, the PM core may unset the direct_complete flag for a parent
      device, in case its child device are being system suspended before. In this
      scenario, the PM core invokes the system sleep callbacks, no matter if the
      device is runtime suspended or not.
      
      Particularly in cases of an existing i2c slave device, the above path is
      triggered, which breaks the assumption that the i2c device is always
      runtime resumed whenever the dw_i2c_plat_suspend() is being called.
      
      More precisely, dw_i2c_plat_suspend() calls clk_core_disable() and
      clk_core_unprepare(), for an already disabled/unprepared clock, leading to
      a splat in the log about clocks calls being wrongly balanced and breaking
      system sleep.
      
      To still allow the direct_complete path in cases when it's possible, but
      also to keep the fix simple, let's runtime resume the i2c device in the
      ->suspend() callback, before continuing to put the device into low power
      state.
      
      Note, in cases when the i2c device is attached to the ACPI PM domain, this
      problem doesn't occur, because ACPI's ->suspend() callback, assigned to
      acpi_subsys_suspend(), already calls pm_runtime_resume() for the device.
      
      It should also be noted that this change does not fix commit 8503ff16
      ("i2c: designware: Avoid unnecessary resuming during system suspend").
      Because for the non-ACPI case, the system sleep support was already broken
      prior that point.
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Tested-by: default avatarJarkko Nikula <jarkko.nikula@linux.intel.com>
      Acked-by: default avatarJarkko Nikula <jarkko.nikula@linux.intel.com>
      Reviewed-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c237efed
    • Ross Zwisler's avatar
      dax: fix deadlock due to misaligned PMD faults · 3a9495fd
      Ross Zwisler authored
      commit fffa281b upstream.
      
      In DAX there are two separate places where the 2MiB range of a PMD is
      defined.
      
      The first is in the page tables, where a PMD mapping inserted for a
      given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
      PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
      address to the 2MiB boundary above the address.
      
      So, for example, a fault at address 3MiB (0x30 0000) falls within the
      PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
      
      The second PMD range is in the mapping->page_tree, where a given file
      offset is covered by a radix tree entry that spans from one 2MiB aligned
      file offset to another 2MiB aligned file offset.
      
      So, for example, the file offset for 3MiB (pgoff 768) falls within the
      PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
      512) to 4MiB (pgoff 1024).
      
      This system works so long as the addresses and file offsets for a given
      mapping both have the same offsets relative to the start of each PMD.
      
      Consider the case where the starting address for a given file isn't 2MiB
      aligned - say our faulting address is 3 MiB (0x30 0000), but that
      corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
      the mapping are misaligned so that the 2MiB range defined in the page
      tables never matches up with the 2MiB range defined in the radix tree.
      
      The current code notices this case for DAX faults to storage with the
      following test in dax_pmd_insert_mapping():
      
      	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
      		goto unlock_fallback;
      
      This test makes sure that the pfn we get from the driver is 2MiB
      aligned, and relies on the assumption that the 2MiB alignment of the pfn
      we get back from the driver matches the 2MiB alignment of the faulting
      address.
      
      However, faults to holes were not checked and we could hit the problem
      described above.
      
      This was reported in response to the NVML nvml/src/test/pmempool_sync
      TEST5:
      
      	$ cd nvml/src/test/pmempool_sync
      	$ make TEST5
      
      You can grab NVML here:
      
      	https://github.com/pmem/nvml/
      
      The dmesg warning you see when you hit this error is:
      
        WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
      
      Where we notice in dax_insert_mapping_entry() that the radix tree entry
      we are about to replace doesn't match the locked entry that we had
      previously inserted into the tree.  This happens because the initial
      insertion was done in grab_mapping_entry() using a pgoff calculated from
      the faulting address (vmf->address), and the replacement in
      dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
      vmf->pgoff.
      
      In our failure case those two page offsets (one calculated from
      vmf->address, one using vmf->pgoff) point to different order 9 radix
      tree entries.
      
      This failure case can result in a deadlock because the radix tree unlock
      also happens on the pgoff calculated from vmf->address.  This means that
      the locked radix tree entry that we swapped in to the tree in
      dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
      future faults to that 2MiB range will block forever.
      
      Fix this by validating that the faulting address's PMD offset matches
      the PMD offset from the start of the file.  This check is done at the
      very beginning of the fault and covers faults that would have mapped to
      storage as well as faults to holes.  I left the COLOUR check in
      dax_pmd_insert_mapping() in place in case we ever hit the insanity
      condition where the alignment of the pfn we get from the driver doesn't
      match the alignment of the userspace address.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatar"Slusarz, Marcin" <marcin.slusarz@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a9495fd
    • Kirill A. Shutemov's avatar
      mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled · 735a252f
      Kirill A. Shutemov authored
      commit 435c0b87 upstream.
      
      /sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
      to allocate huge pages when allocate pages for private in-kernel shmem
      mount.
      
      Unfortunately, as Dan noticed, I've screwed it up and the only way to
      make kernel allocate huge page for the mount is to use "force" there.
      All other values will be effectively ignored.
      
      Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
      Fixes: 5a6e75f8 ("shmem: prepare huge= mount option and sysfs knob")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      735a252f
    • Chen Yu's avatar
      PM/hibernate: touch NMI watchdog when creating snapshot · b2719637
      Chen Yu authored
      commit 556b969a upstream.
      
      There is a problem that when counting the pages for creating the
      hibernation snapshot will take significant amount of time, especially on
      system with large memory.  Since the counting job is performed with irq
      disabled, this might lead to NMI lockup.  The following warning were
      found on a system with 1.5TB DRAM:
      
        Freezing user space processes ... (elapsed 0.002 seconds) done.
        OOM killer disabled.
        PM: Preallocating image memory...
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
        CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
        task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
        RIP: 0010:memory_bm_find_bit+0xf4/0x100
        Call Trace:
         swsusp_set_page_free+0x2b/0x30
         mark_free_pages+0x147/0x1c0
         count_data_pages+0x41/0xa0
         hibernate_preallocate_memory+0x80/0x450
         hibernation_snapshot+0x58/0x410
         hibernate+0x17c/0x310
         state_store+0xdf/0xf0
         kobj_attr_store+0xf/0x20
         sysfs_kf_write+0x37/0x40
         kernfs_fop_write+0x11c/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xb1/0x1a0
         SyS_write+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        ...
        done (allocated 6590003 pages)
        PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)
      
      It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
      triggered.  In case the timeout of the NMI watch dog has been set to 1
      second, a safe interval should be 6590003/20 = 320k pages in theory.
      However there might also be some platforms running at a lower frequency,
      so feed the watchdog every 100k pages.
      
      [yu.c.chen@intel.com: simplification]
        Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
      [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
      Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.comSigned-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Reported-by: default avatarJan Filipcewicz <jan.filipcewicz@intel.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2719637
    • Vineet Gupta's avatar
      ARCv2: PAE40: set MSB even if !CONFIG_ARC_HAS_PAE40 but PAE exists in SoC · 8b366972
      Vineet Gupta authored
      commit b5ddb6d5 upstream.
      
      PAE40 confiuration in hardware extends some of the address registers
      for TLB/cache ops to 2 words.
      
      So far kernel was NOT setting the higher word if feature was not enabled
      in software which is wrong. Those need to be set to 0 in such case.
      
      Normally this would be done in the cache flush / tlb ops, however since
      these registers only exist conditionally, this would have to be
      conditional to a flag being set on boot which is expensive/ugly -
      specially for the more common case of PAE exists but not in use.
      Optimize that by zero'ing them once at boot - nobody will write to
      them afterwards
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b366972
    • Alexey Brodkin's avatar
      ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses · fcedf2f2
      Alexey Brodkin authored
      commit 7d79cee2 upstream.
      
      It is necessary to explicitly set both SLC_AUX_RGN_START1 and SLC_AUX_RGN_END1
      which hold MSB bits of the physical address correspondingly of region start
      and end otherwise SLC region operation is executed in unpredictable manner
      
      Without this patch, SLC flushes on HSDK (IOC disabled) were taking
      seconds.
      Reported-by: default avatarVladimir Kondratiev <vladimir.kondratiev@intel.com>
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      [vgupta: PAR40 regs only written if PAE40 exist]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcedf2f2
    • Alexey Brodkin's avatar
      ARCv2: SLC: Make sure busy bit is set properly for region ops · 763ad317
      Alexey Brodkin authored
      commit b37174d9 upstream.
      
      c70c4733 "ARCv2: SLC: Make sure busy bit is set properly on SLC flushing"
      fixes problem for entire SLC operation where the problem was initially
      caught. But given a nature of the issue it is perfectly possible for
      busy bit to be read incorrectly even when region operation was started.
      
      So extending initial fix for regional operation as well.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      763ad317