1. 03 May, 2017 15 commits
  2. 30 Apr, 2017 18 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.4.65 · 418b9904
      Greg Kroah-Hartman authored
      418b9904
    • Peter Zijlstra's avatar
      perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race · 416bd4a3
      Peter Zijlstra authored
      commit 321027c1 upstream.
      
      Di Shen reported a race between two concurrent sys_perf_event_open()
      calls where both try and move the same pre-existing software group
      into a hardware context.
      
      The problem is exactly that described in commit:
      
        f63a8daa ("perf: Fix event->ctx locking")
      
      ... where, while we wait for a ctx->mutex acquisition, the event->ctx
      relation can have changed under us.
      
      That very same commit failed to recognise sys_perf_event_context() as an
      external access vector to the events and thereby didn't apply the
      established locking rules correctly.
      
      So while one sys_perf_event_open() call is stuck waiting on
      mutex_lock_double(), the other (which owns said locks) moves the group
      about. So by the time the former sys_perf_event_open() acquires the
      locks, the context we've acquired is stale (and possibly dead).
      
      Apply the established locking rules as per perf_event_ctx_lock_nested()
      to the mutex_lock_double() for the 'move_group' case. This obviously means
      we need to validate state after we acquire the locks.
      
      Reported-by: Di Shen (Keen Lab)
      Tested-by: default avatarJohn Dias <joaodias@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Min Chong <mchong@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: f63a8daa ("perf: Fix event->ctx locking")
      Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      [bwh: Backported to 4.4:
       - Test perf_event::group_flags instead of group_caps
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      416bd4a3
    • Eric Dumazet's avatar
      ping: implement proper locking · b7f47c79
      Eric Dumazet authored
      commit 43a66845 upstream.
      
      We got a report of yet another bug in ping
      
      http://www.openwall.com/lists/oss-security/2017/03/24/6
      
      ->disconnect() is not called with socket lock held.
      
      Fix this by acquiring ping rwlock earlier.
      
      Thanks to Daniel, Alexander and Andrey for letting us know this problem.
      
      Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDaniel Jiang <danieljiang0415@gmail.com>
      Reported-by: default avatarSolar Designer <solar@openwall.com>
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b7f47c79
    • EunTaik Lee's avatar
      staging/android/ion : fix a race condition in the ion driver · a7544fdd
      EunTaik Lee authored
      commit 9590232b upstream.
      
      There is a use-after-free problem in the ion driver.
      This is caused by a race condition in the ion_ioctl()
      function.
      
      A handle has ref count of 1 and two tasks on different
      cpus calls ION_IOC_FREE simultaneously.
      
      cpu 0                                   cpu 1
      -------------------------------------------------------
      ion_handle_get_by_id()
      (ref == 2)
                                  ion_handle_get_by_id()
                                  (ref == 3)
      
      ion_free()
      (ref == 2)
      
      ion_handle_put()
      (ref == 1)
      
                                  ion_free()
                                  (ref == 0 so ion_handle_destroy() is
                                  called
                                  and the handle is freed.)
      
                                  ion_handle_put() is called and it
                                  decreases the slub's next free pointer
      
      The problem is detected as an unaligned access in the
      spin lock functions since it uses load exclusive
       instruction. In some cases it corrupts the slub's
      free pointer which causes a mis-aligned access to the
      next free pointer.(kmalloc returns a pointer like
      ffffc0745b4580aa). And it causes lots of other
      hard-to-debug problems.
      
      This symptom is caused since the first member in the
      ion_handle structure is the reference count and the
      ion driver decrements the reference after it has been
      freed.
      
      To fix this problem client->lock mutex is extended
      to protect all the codes that uses the handle.
      Signed-off-by: default avatarEun Taik Lee <eun.taik.lee@samsung.com>
      Reviewed-by: default avatarLaura Abbott <labbott@redhat.com>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      index 7ff2a7ec871f..33b390e7ea31
      a7544fdd
    • Vlad Tsyrklevich's avatar
      vfio/pci: Fix integer overflows, bitmask check · d23ef85b
      Vlad Tsyrklevich authored
      commit 05692d70 upstream.
      
      The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
      user-supplied integers, potentially allowing memory corruption. This
      patch adds appropriate integer overflow checks, checks the range bounds
      for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
      in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
      VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
      vfio_pci_set_irqs_ioctl().
      
      Furthermore, a kzalloc is changed to a kcalloc because the use of a
      kzalloc with an integer multiplication allowed an integer overflow
      condition to be reached without this patch. kcalloc checks for overflow
      and should prevent a similar occurrence.
      Signed-off-by: default avatarVlad Tsyrklevich <vlad@tsyrklevich.net>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d23ef85b
    • Michal Kubeček's avatar
      tipc: check minimum bearer MTU · 65d30f75
      Michal Kubeček authored
      commit 3de81b75 upstream.
      
      Qian Zhang (张谦) reported a potential socket buffer overflow in
      tipc_msg_build() which is also known as CVE-2016-8632: due to
      insufficient checks, a buffer overflow can occur if MTU is too short for
      even tipc headers. As anyone can set device MTU in a user/net namespace,
      this issue can be abused by a regular user.
      
      As agreed in the discussion on Ben Hutchings' original patch, we should
      check the MTU at the moment a bearer is attached rather than for each
      processed packet. We also need to repeat the check when bearer MTU is
      adjusted to new device MTU. UDP case also needs a check to avoid
      overflow when calculating bearer MTU.
      
      Fixes: b97bf3fd ("[TIPC] Initial merge")
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reported-by: default avatarQian Zhang (张谦) <zhangqian-c@360.cn>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 4.4:
       - Adjust context
       - NETDEV_GOING_DOWN and NETDEV_CHANGEMTU cases in net notifier were combined]
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      65d30f75
    • Phil Turnbull's avatar
      netfilter: nfnetlink: correctly validate length of batch messages · 9540baad
      Phil Turnbull authored
      commit c58d6c93 upstream.
      
      If nlh->nlmsg_len is zero then an infinite loop is triggered because
      'skb_pull(skb, msglen);' pulls zero bytes.
      
      The calculation in nlmsg_len() underflows if 'nlh->nlmsg_len <
      NLMSG_HDRLEN' which bypasses the length validation and will later
      trigger an out-of-bound read.
      
      If the length validation does fail then the malformed batch message is
      copied back to userspace. However, we cannot do this because the
      nlh->nlmsg_len can be invalid. This leads to an out-of-bounds read in
      netlink_ack:
      
          [   41.455421] ==================================================================
          [   41.456431] BUG: KASAN: slab-out-of-bounds in memcpy+0x1d/0x40 at addr ffff880119e79340
          [   41.456431] Read of size 4294967280 by task a.out/987
          [   41.456431] =============================================================================
          [   41.456431] BUG kmalloc-512 (Not tainted): kasan: bad access detected
          [   41.456431] -----------------------------------------------------------------------------
          ...
          [   41.456431] Bytes b4 ffff880119e79310: 00 00 00 00 d5 03 00 00 b0 fb fe ff 00 00 00 00  ................
          [   41.456431] Object ffff880119e79320: 20 00 00 00 10 00 05 00 00 00 00 00 00 00 00 00   ...............
          [   41.456431] Object ffff880119e79330: 14 00 0a 00 01 03 fc 40 45 56 11 22 33 10 00 05  .......@EV."3...
          [   41.456431] Object ffff880119e79340: f0 ff ff ff 88 99 aa bb 00 14 00 0a 00 06 fe fb  ................
                                                  ^^ start of batch nlmsg with
                                                     nlmsg_len=4294967280
          ...
          [   41.456431] Memory state around the buggy address:
          [   41.456431]  ffff880119e79400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          [   41.456431]  ffff880119e79480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          [   41.456431] >ffff880119e79500: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
          [   41.456431]                                ^
          [   41.456431]  ffff880119e79580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
          [   41.456431]  ffff880119e79600: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
          [   41.456431] ==================================================================
      
      Fix this with better validation of nlh->nlmsg_len and by setting
      NFNL_BATCH_FAILURE if any batch message fails length validation.
      
      CAP_NET_ADMIN is required to trigger the bugs.
      
      Fixes: 9ea2aa8b ("netfilter: nfnetlink: validate nfnetlink header from batch")
      Signed-off-by: default avatarPhil Turnbull <phil.turnbull@oracle.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9540baad
    • Mauro Carvalho Chehab's avatar
      xc2028: avoid use after free · 0d9dac5d
      Mauro Carvalho Chehab authored
      commit 8dfbcc43 upstream.
      
      If struct xc2028_config is passed without a firmware name,
      the following trouble may happen:
      
      [11009.907205] xc2028 5-0061: type set to XCeive xc2028/xc3028 tuner
      [11009.907491] ==================================================================
      [11009.907750] BUG: KASAN: use-after-free in strcmp+0x96/0xb0 at addr ffff8803bd78ab40
      [11009.907992] Read of size 1 by task modprobe/28992
      [11009.907994] =============================================================================
      [11009.907997] BUG kmalloc-16 (Tainted: G        W      ): kasan: bad access detected
      [11009.907999] -----------------------------------------------------------------------------
      
      [11009.908008] INFO: Allocated in xhci_urb_enqueue+0x214/0x14c0 [xhci_hcd] age=0 cpu=3 pid=28992
      [11009.908012] 	___slab_alloc+0x581/0x5b0
      [11009.908014] 	__slab_alloc+0x51/0x90
      [11009.908017] 	__kmalloc+0x27b/0x350
      [11009.908022] 	xhci_urb_enqueue+0x214/0x14c0 [xhci_hcd]
      [11009.908026] 	usb_hcd_submit_urb+0x1e8/0x1c60
      [11009.908029] 	usb_submit_urb+0xb0e/0x1200
      [11009.908032] 	usb_serial_generic_write_start+0xb6/0x4c0
      [11009.908035] 	usb_serial_generic_write+0x92/0xc0
      [11009.908039] 	usb_console_write+0x38a/0x560
      [11009.908045] 	call_console_drivers.constprop.14+0x1ee/0x2c0
      [11009.908051] 	console_unlock+0x40d/0x900
      [11009.908056] 	vprintk_emit+0x4b4/0x830
      [11009.908061] 	vprintk_default+0x1f/0x30
      [11009.908064] 	printk+0x99/0xb5
      [11009.908067] 	kasan_report_error+0x10a/0x550
      [11009.908070] 	__asan_report_load1_noabort+0x43/0x50
      [11009.908074] INFO: Freed in xc2028_set_config+0x90/0x630 [tuner_xc2028] age=1 cpu=3 pid=28992
      [11009.908077] 	__slab_free+0x2ec/0x460
      [11009.908080] 	kfree+0x266/0x280
      [11009.908083] 	xc2028_set_config+0x90/0x630 [tuner_xc2028]
      [11009.908086] 	xc2028_attach+0x310/0x8a0 [tuner_xc2028]
      [11009.908090] 	em28xx_attach_xc3028.constprop.7+0x1f9/0x30d [em28xx_dvb]
      [11009.908094] 	em28xx_dvb_init.part.3+0x8e4/0x5cf4 [em28xx_dvb]
      [11009.908098] 	em28xx_dvb_init+0x81/0x8a [em28xx_dvb]
      [11009.908101] 	em28xx_register_extension+0xd9/0x190 [em28xx]
      [11009.908105] 	em28xx_dvb_register+0x10/0x1000 [em28xx_dvb]
      [11009.908108] 	do_one_initcall+0x141/0x300
      [11009.908111] 	do_init_module+0x1d0/0x5ad
      [11009.908114] 	load_module+0x6666/0x9ba0
      [11009.908117] 	SyS_finit_module+0x108/0x130
      [11009.908120] 	entry_SYSCALL_64_fastpath+0x16/0x76
      [11009.908123] INFO: Slab 0xffffea000ef5e280 objects=25 used=25 fp=0x          (null) flags=0x2ffff8000004080
      [11009.908126] INFO: Object 0xffff8803bd78ab40 @offset=2880 fp=0x0000000000000001
      
      [11009.908130] Bytes b4 ffff8803bd78ab30: 01 00 00 00 2a 07 00 00 9d 28 00 00 01 00 00 00  ....*....(......
      [11009.908133] Object ffff8803bd78ab40: 01 00 00 00 00 00 00 00 b0 1d c3 6a 00 88 ff ff  ...........j....
      [11009.908137] CPU: 3 PID: 28992 Comm: modprobe Tainted: G    B   W       4.5.0-rc1+ #43
      [11009.908140] Hardware name:                  /NUC5i7RYB, BIOS RYBDWi35.86A.0350.2015.0812.1722 08/12/2015
      [11009.908142]  ffff8803bd78a000 ffff8802c273f1b8 ffffffff81932007 ffff8803c6407a80
      [11009.908148]  ffff8802c273f1e8 ffffffff81556759 ffff8803c6407a80 ffffea000ef5e280
      [11009.908153]  ffff8803bd78ab40 dffffc0000000000 ffff8802c273f210 ffffffff8155ccb4
      [11009.908158] Call Trace:
      [11009.908162]  [<ffffffff81932007>] dump_stack+0x4b/0x64
      [11009.908165]  [<ffffffff81556759>] print_trailer+0xf9/0x150
      [11009.908168]  [<ffffffff8155ccb4>] object_err+0x34/0x40
      [11009.908171]  [<ffffffff8155f260>] kasan_report_error+0x230/0x550
      [11009.908175]  [<ffffffff81237d71>] ? trace_hardirqs_off_caller+0x21/0x290
      [11009.908179]  [<ffffffff8155e926>] ? kasan_unpoison_shadow+0x36/0x50
      [11009.908182]  [<ffffffff8155f5c3>] __asan_report_load1_noabort+0x43/0x50
      [11009.908185]  [<ffffffff8155ea00>] ? __asan_register_globals+0x50/0xa0
      [11009.908189]  [<ffffffff8194cea6>] ? strcmp+0x96/0xb0
      [11009.908192]  [<ffffffff8194cea6>] strcmp+0x96/0xb0
      [11009.908196]  [<ffffffffa13ba4ac>] xc2028_set_config+0x15c/0x630 [tuner_xc2028]
      [11009.908200]  [<ffffffffa13bac90>] xc2028_attach+0x310/0x8a0 [tuner_xc2028]
      [11009.908203]  [<ffffffff8155ea78>] ? memset+0x28/0x30
      [11009.908206]  [<ffffffffa13ba980>] ? xc2028_set_config+0x630/0x630 [tuner_xc2028]
      [11009.908211]  [<ffffffffa157a59a>] em28xx_attach_xc3028.constprop.7+0x1f9/0x30d [em28xx_dvb]
      [11009.908215]  [<ffffffffa157aa2a>] ? em28xx_dvb_init.part.3+0x37c/0x5cf4 [em28xx_dvb]
      [11009.908219]  [<ffffffffa157a3a1>] ? hauppauge_hvr930c_init+0x487/0x487 [em28xx_dvb]
      [11009.908222]  [<ffffffffa01795ac>] ? lgdt330x_attach+0x1cc/0x370 [lgdt330x]
      [11009.908226]  [<ffffffffa01793e0>] ? i2c_read_demod_bytes.isra.2+0x210/0x210 [lgdt330x]
      [11009.908230]  [<ffffffff812e87d0>] ? ref_module.part.15+0x10/0x10
      [11009.908233]  [<ffffffff812e56e0>] ? module_assert_mutex_or_preempt+0x80/0x80
      [11009.908238]  [<ffffffffa157af92>] em28xx_dvb_init.part.3+0x8e4/0x5cf4 [em28xx_dvb]
      [11009.908242]  [<ffffffffa157a6ae>] ? em28xx_attach_xc3028.constprop.7+0x30d/0x30d [em28xx_dvb]
      [11009.908245]  [<ffffffff8195222d>] ? string+0x14d/0x1f0
      [11009.908249]  [<ffffffff8195381f>] ? symbol_string+0xff/0x1a0
      [11009.908253]  [<ffffffff81953720>] ? uuid_string+0x6f0/0x6f0
      [11009.908257]  [<ffffffff811a775e>] ? __kernel_text_address+0x7e/0xa0
      [11009.908260]  [<ffffffff8104b02f>] ? print_context_stack+0x7f/0xf0
      [11009.908264]  [<ffffffff812e9846>] ? __module_address+0xb6/0x360
      [11009.908268]  [<ffffffff8137fdc9>] ? is_ftrace_trampoline+0x99/0xe0
      [11009.908271]  [<ffffffff811a775e>] ? __kernel_text_address+0x7e/0xa0
      [11009.908275]  [<ffffffff81240a70>] ? debug_check_no_locks_freed+0x290/0x290
      [11009.908278]  [<ffffffff8104a24b>] ? dump_trace+0x11b/0x300
      [11009.908282]  [<ffffffffa13e8143>] ? em28xx_register_extension+0x23/0x190 [em28xx]
      [11009.908285]  [<ffffffff81237d71>] ? trace_hardirqs_off_caller+0x21/0x290
      [11009.908289]  [<ffffffff8123ff56>] ? trace_hardirqs_on_caller+0x16/0x590
      [11009.908292]  [<ffffffff812404dd>] ? trace_hardirqs_on+0xd/0x10
      [11009.908296]  [<ffffffffa13e8143>] ? em28xx_register_extension+0x23/0x190 [em28xx]
      [11009.908299]  [<ffffffff822dcbb0>] ? mutex_trylock+0x400/0x400
      [11009.908302]  [<ffffffff810021a1>] ? do_one_initcall+0x131/0x300
      [11009.908306]  [<ffffffff81296dc7>] ? call_rcu_sched+0x17/0x20
      [11009.908309]  [<ffffffff8159e708>] ? put_object+0x48/0x70
      [11009.908314]  [<ffffffffa1579f11>] em28xx_dvb_init+0x81/0x8a [em28xx_dvb]
      [11009.908317]  [<ffffffffa13e81f9>] em28xx_register_extension+0xd9/0x190 [em28xx]
      [11009.908320]  [<ffffffffa0150000>] ? 0xffffffffa0150000
      [11009.908324]  [<ffffffffa0150010>] em28xx_dvb_register+0x10/0x1000 [em28xx_dvb]
      [11009.908327]  [<ffffffff810021b1>] do_one_initcall+0x141/0x300
      [11009.908330]  [<ffffffff81002070>] ? try_to_run_init_process+0x40/0x40
      [11009.908333]  [<ffffffff8123ff56>] ? trace_hardirqs_on_caller+0x16/0x590
      [11009.908337]  [<ffffffff8155e926>] ? kasan_unpoison_shadow+0x36/0x50
      [11009.908340]  [<ffffffff8155e926>] ? kasan_unpoison_shadow+0x36/0x50
      [11009.908343]  [<ffffffff8155e926>] ? kasan_unpoison_shadow+0x36/0x50
      [11009.908346]  [<ffffffff8155ea37>] ? __asan_register_globals+0x87/0xa0
      [11009.908350]  [<ffffffff8144da7b>] do_init_module+0x1d0/0x5ad
      [11009.908353]  [<ffffffff812f2626>] load_module+0x6666/0x9ba0
      [11009.908356]  [<ffffffff812e9c90>] ? symbol_put_addr+0x50/0x50
      [11009.908361]  [<ffffffffa1580037>] ? em28xx_dvb_init.part.3+0x5989/0x5cf4 [em28xx_dvb]
      [11009.908366]  [<ffffffff812ebfc0>] ? module_frob_arch_sections+0x20/0x20
      [11009.908369]  [<ffffffff815bc940>] ? open_exec+0x50/0x50
      [11009.908374]  [<ffffffff811671bb>] ? ns_capable+0x5b/0xd0
      [11009.908377]  [<ffffffff812f5e58>] SyS_finit_module+0x108/0x130
      [11009.908379]  [<ffffffff812f5d50>] ? SyS_init_module+0x1f0/0x1f0
      [11009.908383]  [<ffffffff81004044>] ? lockdep_sys_exit_thunk+0x12/0x14
      [11009.908394]  [<ffffffff822e6936>] entry_SYSCALL_64_fastpath+0x16/0x76
      [11009.908396] Memory state around the buggy address:
      [11009.908398]  ffff8803bd78aa00: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [11009.908401]  ffff8803bd78aa80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [11009.908403] >ffff8803bd78ab00: fc fc fc fc fc fc fc fc 00 00 fc fc fc fc fc fc
      [11009.908405]                                            ^
      [11009.908407]  ffff8803bd78ab80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [11009.908409]  ffff8803bd78ac00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [11009.908411] ==================================================================
      
      In order to avoid it, let's set the cached value of the firmware
      name to NULL after freeing it. While here, return an error if
      the memory allocation fails.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d9dac5d
    • Eric W. Biederman's avatar
      mnt: Add a per mount namespace limit on the number of mounts · c50fd34e
      Eric W. Biederman authored
      commit d2921684 upstream.
      
      CAI Qian <caiqian@redhat.com> pointed out that the semantics
      of shared subtrees make it possible to create an exponentially
      increasing number of mounts in a mount namespace.
      
          mkdir /tmp/1 /tmp/2
          mount --make-rshared /
          for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done
      
      Will create create 2^20 or 1048576 mounts, which is a practical problem
      as some people have managed to hit this by accident.
      
      As such CVE-2016-6213 was assigned.
      
      Ian Kent <raven@themaw.net> described the situation for autofs users
      as follows:
      
      > The number of mounts for direct mount maps is usually not very large because of
      > the way they are implemented, large direct mount maps can have performance
      > problems. There can be anywhere from a few (likely case a few hundred) to less
      > than 10000, plus mounts that have been triggered and not yet expired.
      >
      > Indirect mounts have one autofs mount at the root plus the number of mounts that
      > have been triggered and not yet expired.
      >
      > The number of autofs indirect map entries can range from a few to the common
      > case of several thousand and in rare cases up to between 30000 and 50000. I've
      > not heard of people with maps larger than 50000 entries.
      >
      > The larger the number of map entries the greater the possibility for a large
      > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
      > more active mounts.
      
      So I am setting the default number of mounts allowed per mount
      namespace at 100,000.  This is more than enough for any use case I
      know of, but small enough to quickly stop an exponential increase
      in mounts.  Which should be perfect to catch misconfigurations and
      malfunctioning programs.
      
      For anyone who needs a higher limit this can be changed by writing
      to the new /proc/sys/fs/mount-max sysctl.
      Tested-by: default avatarCAI Qian <caiqian@redhat.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      [bwh: Backported to 4.4: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c50fd34e
    • Jon Paul Maloy's avatar
      tipc: fix socket timer deadlock · 59e0cd11
      Jon Paul Maloy authored
      commit f1d048f2 upstream.
      
      We sometimes observe a 'deadly embrace' type deadlock occurring
      between mutually connected sockets on the same node. This happens
      when the one-hour peer supervision timers happen to expire
      simultaneously in both sockets.
      
      The scenario is as follows:
      
      CPU 1:                          CPU 2:
      --------                        --------
      tipc_sk_timeout(sk1)            tipc_sk_timeout(sk2)
        lock(sk1.slock)                 lock(sk2.slock)
        msg_create(probe)               msg_create(probe)
        unlock(sk1.slock)               unlock(sk2.slock)
        tipc_node_xmit_skb()            tipc_node_xmit_skb()
          tipc_node_xmit()                tipc_node_xmit()
            tipc_sk_rcv(sk2)                tipc_sk_rcv(sk1)
              lock(sk2.slock)                 lock((sk1.slock)
              filter_rcv()                    filter_rcv()
                tipc_sk_proto_rcv()             tipc_sk_proto_rcv()
                  msg_create(probe_rsp)           msg_create(probe_rsp)
                  tipc_sk_respond()               tipc_sk_respond()
                    tipc_node_xmit_skb()            tipc_node_xmit_skb()
                      tipc_node_xmit()                tipc_node_xmit()
                        tipc_sk_rcv(sk1)                tipc_sk_rcv(sk2)
                          lock((sk1.slock)                lock((sk2.slock)
                          ===> DEADLOCK                   ===> DEADLOCK
      
      Further analysis reveals that there are three different locations in the
      socket code where tipc_sk_respond() is called within the context of the
      socket lock, with ensuing risk of similar deadlocks.
      
      We now solve this by passing a buffer queue along with all upcalls where
      sk_lock.slock may potentially be held. Response or rejected message
      buffers are accumulated into this queue instead of being sent out
      directly, and only sent once we know we are safely outside the slock
      context.
      Reported-by: default avatarGUNA <gbalasun@gmail.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59e0cd11
    • Parthasarathy Bhuvaragan's avatar
      tipc: fix random link resets while adding a second bearer · abc025d1
      Parthasarathy Bhuvaragan authored
      commit d2f394dc upstream.
      
      In a dual bearer configuration, if the second tipc link becomes
      active while the first link still has pending nametable "bulk"
      updates, it randomly leads to reset of the second link.
      
      When a link is established, the function named_distribute(),
      fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
      with NAME_DISTRIBUTOR message for each PUBLICATION.
      However, the function named_distribute() allocates the buffer by
      increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
      This consumes the space allocated for TUNNEL_PROTOCOL.
      
      When establishing the second link, the link shall tunnel all the
      messages in the first link queue including the "bulk" update.
      As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
      the link mtu the transmission fails (-EMSGSIZE).
      
      Thus, the synch point based on the message count of the tunnel
      packets is never reached leading to link timeout.
      
      In this commit, we adjust the size of name distributor message so that
      they can be tunnelled.
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      abc025d1
    • Arnd Bergmann's avatar
      gfs2: avoid uninitialized variable warning · d39cb4a5
      Arnd Bergmann authored
      commit 67893f12 upstream.
      
      We get a bogus warning about a potential uninitialized variable
      use in gfs2, because the compiler does not figure out that we
      never use the leaf number if get_leaf_nr() returns an error:
      
      fs/gfs2/dir.c: In function 'get_first_leaf':
      fs/gfs2/dir.c:802:9: warning: 'leaf_no' may be used uninitialized in this function [-Wmaybe-uninitialized]
      fs/gfs2/dir.c: In function 'dir_split_leaf':
      fs/gfs2/dir.c:1021:8: warning: 'leaf_no' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      Changing the 'if (!error)' to 'if (!IS_ERR_VALUE(error))' is
      sufficient to let gcc understand that this is exactly the same
      condition as in IS_ERR() so it can optimize the code path enough
      to understand it.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d39cb4a5
    • Arnd Bergmann's avatar
      hostap: avoid uninitialized variable use in hfa384x_get_rid · 9a35bc2a
      Arnd Bergmann authored
      commit 48dc5fb3 upstream.
      
      The driver reads a value from hfa384x_from_bap(), which may fail,
      and then assigns the value to a local variable. gcc detects that
      in in the failure case, the 'rlen' variable now contains
      uninitialized data:
      
      In file included from ../drivers/net/wireless/intersil/hostap/hostap_pci.c:220:0:
      drivers/net/wireless/intersil/hostap/hostap_hw.c: In function 'hfa384x_get_rid':
      drivers/net/wireless/intersil/hostap/hostap_hw.c:842:5: warning: 'rec' may be used uninitialized in this function [-Wmaybe-uninitialized]
        if (le16_to_cpu(rec.len) == 0) {
      
      This restructures the function as suggested by Russell King, to
      make it more readable and get more reliable error handling, by
      handling each failure mode using a goto.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a35bc2a
    • Arnd Bergmann's avatar
      tty: nozomi: avoid a harmless gcc warning · 58f80ccf
      Arnd Bergmann authored
      commit a4f642a8 upstream.
      
      The nozomi wireless data driver has its own helper function to
      transfer data from a FIFO, doing an extra byte swap on big-endian
      architectures, presumably to bring the data back into byte-serial
      order after readw() or readl() perform their implicit byteswap.
      
      This helper function is used in the receive_data() function to
      first read the length into a 32-bit variable, which causes
      a compile-time warning:
      
      drivers/tty/nozomi.c: In function 'receive_data':
      drivers/tty/nozomi.c:857:9: warning: 'size' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      The problem is that gcc is unsure whether the data was actually
      read or not. We know that it is at this point, so we can replace
      it with a single readl() to shut up that warning.
      
      I am leaving the byteswap in there, to preserve the existing
      behavior, even though this seems fishy: Reading the length of
      the data into a cpu-endian variable should normally not use
      a second byteswap on big-endian systems, unless the hardware
      is aware of the CPU endianess.
      
      There appears to be a lot more confusion about endianess in this
      driver, so it probably has not worked on big-endian systems in
      a long time, if ever, and I have no way to test it. It's well
      possible that this driver has not been used by anyone in a while,
      the last patch that looks like it was tested on the hardware is
      from 2008.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58f80ccf
    • Jon Paul Maloy's avatar
      tipc: correct error in node fsm · 2847736f
      Jon Paul Maloy authored
      commit c4282ca7 upstream.
      
      commit 88e8ac70 ("tipc: reduce transmission rate of reset messages
      when link is down") revealed a flaw in the node FSM, as defined in
      the log of commit 66996b6c ("tipc: extend node FSM").
      
      We see the following scenario:
      1: Node B receives a RESET message from node A before its link endpoint
         is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
         event will not change the node FSM state, but the (distinct) link FSM
         will move to state RESETTING.
      2: As an effect of the previous event, the local endpoint on B will
         declare node A lost, and post the event SELF_DOWN to the its node
         FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
         that no messages will be accepted from A until it receives another
         RESET message that confirms that A's endpoint has been reset. This
         is  wasteful, since we know this as a fact already from the first
         received RESET, but worse is that the link instance's FSM has not
         wasted this information, but instead moved on to state ESTABLISHING,
         meaning that it repeatedly sends out ACTIVATE messages to the reset
         peer A.
      3: Node A will receive one of the ACTIVATE messages, move its link FSM
         to state ESTABLISHED, and start repeatedly sending out STATE messages
         to node B.
      4: Node B will consistently drop these messages, since it can only accept
         accept a RESET according to its node FSM.
      5: After four lost STATE messages node A will reset its link and start
         repeatedly sending out RESET messages to B.
      6: Because of the reduced send rate for RESET messages, it is very
         likely that A will receive an ACTIVATE (which is sent out at a much
         higher frequency) before it gets the chance to send a RESET, and A
         may hence quickly move back to state ESTABLISHED and continue sending
         out STATE messages, which will again be dropped by B.
      7: GOTO 5.
      8: After having repeated the cycle 5-7 a number of times, node A will
         by chance get in between with sending a RESET, and the situation is
         resolved.
      
      Unfortunately, we have seen that it may take a substantial amount of
      time before this vicious loop is broken, sometimes in the order of
      minutes.
      
      We correct this by making a small correction to the node FSM: When a
      node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
      moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
      SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
      need to wait for RESET confirmation from of an endpoint that we alread
      know has been reset. It also means that node B in the scenario above
      will not be dropping incoming STATE messages, and the link can come up
      immediately.
      
      Finally, a symmetry comparison reveals that the  FSM has a similar
      error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
      Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
      to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
      of this logical error, we choose fix this one, too.
      
      The node FSM looks as follows after those changes:
      
                                 +----------------------------------------+
                                 |                           PEER_DOWN_EVT|
                                 |                                        |
        +------------------------+----------------+                       |
        |SELF_DOWN_EVT           |                |                       |
        |                        |                |                       |
        |              +-----------+          +-----------+               |
        |              |NODE_      |          |NODE_      |               |
        |   +----------|FAILINGOVER|<---------|SYNCHING   |-----------+   |
        |   |SELF_     +-----------+ FAILOVER_+-----------+   PEER_   |   |
        |   |DOWN_EVT   |          A BEGIN_EVT  A         |   DOWN_EVT|   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |FAILOVER_ |FAILOVER_   |SYNCH_   |SYNCH_     |   |
        |   |           |END_EVT   |BEGIN_EVT   |BEGIN_EVT|END_EVT    |   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |         +--------------+        |           |   |
        |   |           +-------->|   SELF_UP_   |<-------+           |   |
        |   |   +-----------------|   PEER_UP    |----------------+   |   |
        |   |   |SELF_DOWN_EVT    +--------------+   PEER_DOWN_EVT|   |   |
        |   |   |                    A        A                   |   |   |
        |   |   |                    |        |                   |   |   |
        |   |   |         PEER_UP_EVT|        |SELF_UP_EVT        |   |   |
        |   |   |                    |        |                   |   |   |
        V   V   V                    |        |                   V   V   V
      +------------+       +-----------+    +-----------+       +------------+
      |SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
      |PEER_LEAVING|       |PEER_COMING|    |SELF_COMING|       |SELF_LEAVING|
      +------------+       +-----------+    +-----------+       +------------+
             |               |       A        A       |                |
             |               |       |        |       |                |
             |       SELF_   |       |SELF_   |PEER_  |PEER_           |
             |       DOWN_EVT|       |UP_EVT  |UP_EVT |DOWN_EVT        |
             |               |       |        |       |                |
             |               |       |        |       |                |
             |               |    +--------------+    |                |
             |PEER_DOWN_EVT  +--->|  SELF_DOWN_  |<---+   SELF_DOWN_EVT|
             +------------------->|  PEER_DOWN   |<--------------------+
                                  +--------------+
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2847736f
    • Jon Paul Maloy's avatar
      tipc: re-enable compensation for socket receive buffer double counting · 76ca3053
      Jon Paul Maloy authored
      commit 7c8bcfb1 upstream.
      
      In the refactoring commit d570d864 ("tipc: enqueue arrived buffers
      in socket in separate function") we did by accident replace the test
      
      if (sk->sk_backlog.len == 0)
           atomic_set(&tsk->dupl_rcvcnt, 0);
      
      with
      
      if (sk->sk_backlog.len)
           atomic_set(&tsk->dupl_rcvcnt, 0);
      
      This effectively disables the compensation we have for the double
      receive buffer accounting that occurs temporarily when buffers are
      moved from the backlog to the socket receive queue. Until now, this
      has gone unnoticed because of the large receive buffer limits we are
      applying, but becomes indispensable when we reduce this buffer limit
      later in this series.
      
      We now fix this by inverting the mentioned condition.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76ca3053
    • Erik Hugne's avatar
      tipc: make dist queue pernet · 3f315590
      Erik Hugne authored
      commit 541726ab upstream.
      
      Nametable updates received from the network that cannot be applied
      immediately are placed on a defer queue. This queue is global to the
      TIPC module, which might cause problems when using TIPC in containers.
      To prevent nametable updates from escaping into the wrong namespace,
      we make the queue pernet instead.
      Signed-off-by: default avatarErik Hugne <erik.hugne@gmail.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3f315590
    • Richard Alpe's avatar
      tipc: make sure IPv6 header fits in skb headroom · 44b3b7e0
      Richard Alpe authored
      commit 9bd160bf upstream.
      
      Expand headroom further in order to be able to fit the larger IPv6
      header. Prior to this patch this caused a skb under panic for certain
      tipc packets when using IPv6 UDP bearer(s).
      Signed-off-by: default avatarRichard Alpe <richard.alpe@ericsson.com>
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44b3b7e0
  3. 27 Apr, 2017 7 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.4.64 · 12f4e1f5
      Greg Kroah-Hartman authored
      12f4e1f5
    • Jon Paul Maloy's avatar
      tipc: fix crash during node removal · 6862fa90
      Jon Paul Maloy authored
      commit d25a0125 upstream.
      
      When the TIPC module is unloaded, we have identified a race condition
      that allows a node reference counter to go to zero and the node instance
      being freed before the node timer is finished with accessing it. This
      leads to occasional crashes, especially in multi-namespace environments.
      
      The scenario goes as follows:
      
      CPU0:(node_stop)                       CPU1:(node_timeout)  // ref == 2
      
      1:                                          if(!mod_timer())
      2: if (del_timer())
      3:   tipc_node_put()                                        // ref -> 1
      4: tipc_node_put()                                          // ref -> 0
      5:   kfree_rcu(node);
      6:                                               tipc_node_get(node)
      7:                                               // BOOM!
      
      We now clean up this functionality as follows:
      
      1) We remove the node pointer from the node lookup table before we
         attempt deactivating the timer. This way, we reduce the risk that
         tipc_node_find() may obtain a valid pointer to an instance marked
         for deletion; a harmless but undesirable situation.
      
      2) We use del_timer_sync() instead of del_timer() to safely deactivate
         the node timer without any risk that it might be reactivated by the
         timeout handler. There is no risk of deadlock here, since the two
         functions never touch the same spinlocks.
      
      3: We remove a pointless tipc_node_get() + tipc_node_put() from the
         timeout handler.
      Reported-by: default avatarZhijiang Hu <huzhijiang@gmail.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6862fa90
    • Dan Williams's avatar
      block: fix del_gendisk() vs blkdev_ioctl crash · 6ddbac9a
      Dan Williams authored
      commit ac34f15e upstream.
      
      When tearing down a block device early in its lifetime, userspace may
      still be performing discovery actions like blkdev_ioctl() to re-read
      partitions.
      
      The nvdimm_revalidate_disk() implementation depends on
      disk->driverfs_dev to be valid at entry.  However, it is set to NULL in
      del_gendisk() and fatally this is happening *before* the disk device is
      deleted from userspace view.
      
      There's no reason for del_gendisk() to clear ->driverfs_dev.  That
      device is the parent of the disk.  It is guaranteed to not be freed
      until the disk, as a child, drops its ->parent reference.
      
      We could also fix this issue locally in nvdimm_revalidate_disk() by
      using disk_to_dev(disk)->parent, but lets fix it globally since
      ->driverfs_dev follows the lifetime of the parent.  Longer term we
      should probably just add a @parent parameter to add_disk(), and stop
      carrying this pointer in the gendisk.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
       CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G           O    4.4.0-rc5 #2257
       [..]
       Call Trace:
        [<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0
        [<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70
        [<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0
        [<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40
        [<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0
        [<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0
        [<ffffffff812916dd>] block_ioctl+0x3d/0x50
        [<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560
        [<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100
        [<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70
        [<ffffffff81264f09>] SyS_ioctl+0x79/0x90
        [<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Reported-by: default avatarRobert Hu <robert.hu@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ddbac9a
    • Dan Williams's avatar
      x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions · d1cc3cdd
      Dan Williams authored
      commit 11e63f6d upstream.
      
      Before we rework the "pmem api" to stop abusing __copy_user_nocache()
      for memcpy_to_pmem() we need to fix cases where we may strand dirty data
      in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
      for arbitrary data transfers from userspace. There is no guarantee that
      these transfers, performed by dax_iomap_actor(), will have aligned
      destinations or aligned transfer lengths. Backstop the usage
      __copy_user_nocache() with explicit cache management in these unaligned
      cases.
      
      Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
      that is saved for a later patch that moves the entirety of the "pmem
      api" into the pmem driver directly.
      
      Fixes: 5de490da ("pmem: add copy_from_iter_pmem() and clear_pmem()")
      Cc: <x86@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1cc3cdd
    • Vitaly Kuznetsov's avatar
      hv: don't reset hv_context.tsc_page on crash · 5693f3fb
      Vitaly Kuznetsov authored
      commit 56ef6718 upstream.
      
      It may happen that secondary CPUs are still alive and resetting
      hv_context.tsc_page will cause a consequent crash in read_hv_clock_tsc()
      as we don't check for it being not NULL there. It is safe as we're not
      freeing this page anyways.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarSumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5693f3fb
    • Vitaly Kuznetsov's avatar
      Drivers: hv: balloon: account for gaps in hot add regions · 03e2fb9b
      Vitaly Kuznetsov authored
      commit cb7a5724 upstream.
      
      I'm observing the following hot add requests from the WS2012 host:
      
      hot_add_req: start_pfn = 0x108200 count = 330752
      hot_add_req: start_pfn = 0x158e00 count = 193536
      hot_add_req: start_pfn = 0x188400 count = 239616
      
      As the host doesn't specify hot add regions we're trying to create
      128Mb-aligned region covering the first request, we create the 0x108000 -
      0x160000 region and we add 0x108000 - 0x158e00 memory. The second request
      passes the pfn_covered() check, we enlarge the region to 0x108000 -
      0x190000 and add 0x158e00 - 0x188200 memory. The problem emerges with the
      third request as it starts at 0x188400 so there is a 0x200 gap which is
      not covered. As the end of our region is 0x190000 now it again passes the
      pfn_covered() check were we just adjust the covered_end_pfn and make it
      0x188400 instead of 0x188200 which means that we'll try to online
      0x188200-0x188400 pages but these pages were never assigned to us and we
      crash.
      
      We can't react to such requests by creating new hot add regions as it may
      happen that the whole suggested range falls into the previously identified
      128Mb-aligned area so we'll end up adding nothing or create intersecting
      regions and our current logic doesn't allow that. Instead, create a list of
      such 'gaps' and check for them in the page online callback.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarSumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      03e2fb9b
    • Vitaly Kuznetsov's avatar
      Drivers: hv: balloon: keep track of where ha_region starts · 8e7a6dbc
      Vitaly Kuznetsov authored
      commit 7cf3b79e upstream.
      
      Windows 2012 (non-R2) does not specify hot add region in hot add requests
      and the logic in hot_add_req() is trying to find a 128Mb-aligned region
      covering the request. It may also happen that host's requests are not 128Mb
      aligned and the created ha_region will start before the first specified
      PFN. We can't online these non-present pages but we don't remember the real
      start of the region.
      
      This is a regression introduced by the commit 5abbbb75 ("Drivers: hv:
      hv_balloon: don't lose memory when onlining order is not natural"). While
      the idea of keeping the 'moving window' was wrong (as there is no guarantee
      that hot add requests come ordered) we should still keep track of
      covered_start_pfn. This is not a revert, the logic is different.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarSumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e7a6dbc