1. 14 Dec, 2014 3 commits
    • Artem Bityutskiy's avatar
      UBIFS: fix a race condition · 9cb35c87
      Artem Bityutskiy authored
      commit 052c2807 upstream.
      
      Hu (hujianyang@huawei.com) discovered a race condition which may lead to a
      situation when UBIFS is unable to mount the file-system after an unclean
      reboot. The problem is theoretical, though.
      
      In UBIFS, we have the log, which basically a set of LEBs in a certain area. The
      log has the tail and the head.
      
      Every time user writes data to the file-system, the UBIFS journal grows, and
      the log grows as well, because we append new reference nodes to the head of the
      log. So the head moves forward all the time, while the log tail stays at the
      same position.
      
      At any time, the UBIFS master node points to the tail of the log. When we mount
      the file-system, we scan the log, and we always start from its tail, because
      this is where the master node points to. The only occasion when the tail of the
      log changes is the commit operation.
      
      The commit operation has 2 phases - "commit start" and "commit end". The former
      is relatively short, and does not involve much I/O. During this phase we mostly
      just build various in-memory lists of the things which have to be written to
      the flash media during "commit end" phase.
      
      During the commit start phase, what we do is we "clean" the log. Indeed, the
      commit operation will index all the data in the journal, so the entire journal
      "disappears", and therefore the data in the log become unneeded. So we just
      move the head of the log to the next LEB, and write the CS node there. This LEB
      will be the tail of the new log when the commit operation finishes.
      
      When the "commit start" phase finishes, users may write more data to the
      file-system, in parallel with the ongoing "commit end" operation. At this point
      the log tail was not changed yet, it is the same as it had been before we
      started the commit. The log head keeps moving forward, though.
      
      The commit operation now needs to write the new master node, and the new master
      node should point to the new log tail. After this the LEBs between the old log
      tail and the new log tail can be unmapped and re-used again.
      
      And here is the possible problem. We do 2 operations: (a) We first update the
      log tail position in memory (see 'ubifs_log_end_commit()'). (b) And then we
      write the master node (see the big lock of code in 'do_commit()').
      
      But nothing prevents the log head from moving forward between (a) and (b), and
      the log head may "wrap" now to the old log tail. And when the "wrap" happens,
      the contends of the log tail gets erased. Now a power cut happens and we are in
      trouble. We end up with the old master node pointing to the old tail, which was
      erased. And replay fails because it expects the master node to point to the
      correct log tail at all times.
      
      This patch merges the abovementioned (a) and (b) operations by moving the master
      node change code to the 'ubifs_log_end_commit()' function, so that it runs with
      the log mutex locked, which will prevent the log from being changed benween
      operations (a) and (b).
      Reported-by: default avatarhujianyang <hujianyang@huawei.com>
      Tested-by: default avatarhujianyang <hujianyang@huawei.com>
      Signed-off-by: default avatarArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9cb35c87
    • Artem Bityutskiy's avatar
      UBIFS: remove mst_mutex · 7944bd2c
      Artem Bityutskiy authored
      commit 07e19dff upstream.
      
      The 'mst_mutex' is not needed since because 'ubifs_write_master()' is only
      called on the mount path and commit path. The mount path is sequential and
      there is no parallelism, and the commit path is also serialized - there is only
      one commit going on at a time.
      Signed-off-by: default avatarArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7944bd2c
    • David Matlack's avatar
      kvm: x86: fix stale mmio cache bug · d9c14518
      David Matlack authored
      commit 56f17dd3 upstream.
      
      The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
      up to userspace:
      
      (1) Guest accesses gpa X without a memory slot. The gfn is cached in
      struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
      the SPTE write-execute-noread so that future accesses cause
      EPT_MISCONFIGs.
      
      (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
      covering the page just accessed.
      
      (3) Guest attempts to read or write to gpa X again. On Intel, this
      generates an EPT_MISCONFIG. The memory slot generation number that
      was incremented in (2) would normally take care of this but we fast
      path mmio faults through quickly_check_mmio_pf(), which only checks
      the per-vcpu mmio cache. Since we hit the cache, KVM passes a
      KVM_EXIT_MMIO up to userspace.
      
      This patch fixes the issue by using the memslot generation number
      to validate the mmio cache.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      [xiaoguangrong: adjust the code to make it simpler for stable-tree fix.]
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Tested-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d9c14518
  2. 05 Nov, 2014 37 commits
    • Ben Hutchings's avatar
      Linux 3.2.64 · 7d039b97
      Ben Hutchings authored
      7d039b97
    • Guillaume Nault's avatar
      l2tp: fix race while getting PMTU on PPP pseudo-wire · 544bd1bf
      Guillaume Nault authored
      commit eed4d839 upstream.
      
      Use dst_entry held by sk_dst_get() to retrieve tunnel's PMTU.
      
      The dst_mtu(__sk_dst_get(tunnel->sock)) call was racy. __sk_dst_get()
      could return NULL if tunnel->sock->sk_dst_cache was reset just before the
      call, thus making dst_mtu() dereference a NULL pointer:
      
      [ 1937.661598] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
      [ 1937.664005] IP: [<ffffffffa049db88>] pppol2tp_connect+0x33d/0x41e [l2tp_ppp]
      [ 1937.664005] PGD daf0c067 PUD d9f93067 PMD 0
      [ 1937.664005] Oops: 0000 [#1] SMP
      [ 1937.664005] Modules linked in: l2tp_ppp l2tp_netlink l2tp_core ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables udp_tunnel pppoe pppox ppp_generic slhc deflate ctr twofish_generic twofish_x86_64_3way xts lrw gf128mul glue_helper twofish_x86_64 twofish_common blowfish_generic blowfish_x86_64 blowfish_common des_generic cbc xcbc rmd160 sha512_generic hmac crypto_null af_key xfrm_algo 8021q garp bridge stp llc tun atmtcp clip atm ext3 mbcache jbd iTCO_wdt coretemp kvm_intel iTCO_vendor_support kvm pcspkr evdev ehci_pci lpc_ich mfd_core i5400_edac edac_core i5k_amb shpchp button processor thermal_sys xfs crc32c_generic libcrc32c dm_mod usbhid sg hid sr_mod sd_mod cdrom crc_t10dif crct10dif_common ata_generic ahci ata_piix tg3 libahci libata uhci_hcd ptp ehci_hcd pps_core usbcore scsi_mod libphy usb_common [last unloaded: l2tp_core]
      [ 1937.664005] CPU: 0 PID: 10022 Comm: l2tpstress Tainted: G           O   3.17.0-rc1 #1
      [ 1937.664005] Hardware name: HP ProLiant DL160 G5, BIOS O12 08/22/2008
      [ 1937.664005] task: ffff8800d8fda790 ti: ffff8800c43c4000 task.ti: ffff8800c43c4000
      [ 1937.664005] RIP: 0010:[<ffffffffa049db88>]  [<ffffffffa049db88>] pppol2tp_connect+0x33d/0x41e [l2tp_ppp]
      [ 1937.664005] RSP: 0018:ffff8800c43c7de8  EFLAGS: 00010282
      [ 1937.664005] RAX: ffff8800da8a7240 RBX: ffff8800d8c64600 RCX: 000001c325a137b5
      [ 1937.664005] RDX: 8c6318c6318c6320 RSI: 000000000000010c RDI: 0000000000000000
      [ 1937.664005] RBP: ffff8800c43c7ea8 R08: 0000000000000000 R09: 0000000000000000
      [ 1937.664005] R10: ffffffffa048e2c0 R11: ffff8800d8c64600 R12: ffff8800ca7a5000
      [ 1937.664005] R13: ffff8800c439bf40 R14: 000000000000000c R15: 0000000000000009
      [ 1937.664005] FS:  00007fd7f610f700(0000) GS:ffff88011a600000(0000) knlGS:0000000000000000
      [ 1937.664005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 1937.664005] CR2: 0000000000000020 CR3: 00000000d9d75000 CR4: 00000000000027e0
      [ 1937.664005] Stack:
      [ 1937.664005]  ffffffffa049da80 ffff8800d8fda790 000000000000005b ffff880000000009
      [ 1937.664005]  ffff8800daf3f200 0000000000000003 ffff8800c43c7e48 ffffffff81109b57
      [ 1937.664005]  ffffffff81109b0e ffffffff8114c566 0000000000000000 0000000000000000
      [ 1937.664005] Call Trace:
      [ 1937.664005]  [<ffffffffa049da80>] ? pppol2tp_connect+0x235/0x41e [l2tp_ppp]
      [ 1937.664005]  [<ffffffff81109b57>] ? might_fault+0x9e/0xa5
      [ 1937.664005]  [<ffffffff81109b0e>] ? might_fault+0x55/0xa5
      [ 1937.664005]  [<ffffffff8114c566>] ? rcu_read_unlock+0x1c/0x26
      [ 1937.664005]  [<ffffffff81309196>] SYSC_connect+0x87/0xb1
      [ 1937.664005]  [<ffffffff813e56f7>] ? sysret_check+0x1b/0x56
      [ 1937.664005]  [<ffffffff8107590d>] ? trace_hardirqs_on_caller+0x145/0x1a1
      [ 1937.664005]  [<ffffffff81213dee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 1937.664005]  [<ffffffff8114c262>] ? spin_lock+0x9/0xb
      [ 1937.664005]  [<ffffffff813092b4>] SyS_connect+0x9/0xb
      [ 1937.664005]  [<ffffffff813e56d2>] system_call_fastpath+0x16/0x1b
      [ 1937.664005] Code: 10 2a 84 81 e8 65 76 bd e0 65 ff 0c 25 10 bb 00 00 4d 85 ed 74 37 48 8b 85 60 ff ff ff 48 8b 80 88 01 00 00 48 8b b8 10 02 00 00 <48> 8b 47 20 ff 50 20 85 c0 74 0f 83 e8 28 89 83 10 01 00 00 89
      [ 1937.664005] RIP  [<ffffffffa049db88>] pppol2tp_connect+0x33d/0x41e [l2tp_ppp]
      [ 1937.664005]  RSP <ffff8800c43c7de8>
      [ 1937.664005] CR2: 0000000000000020
      [ 1939.559375] ---[ end trace 82d44500f28f8708 ]---
      
      Fixes: f34c4a35 ("l2tp: take PMTU from tunnel UDP socket")
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      544bd1bf
    • Nadav Amit's avatar
      KVM: x86: Fix far-jump to non-canonical check · 77e4d28c
      Nadav Amit authored
      commit 7e46dddd upstream.
      
      Commit d1442d85 ("KVM: x86: Handle errors when RIP is set during far
      jumps") introduced a bug that caused the fix to be incomplete.  Due to
      incorrect evaluation, far jump to segment with L bit cleared (i.e., 32-bit
      segment) and RIP with any of the high bits set (i.e, RIP[63:32] != 0) set may
      not trigger #GP.  As we know, this imposes a security problem.
      
      In addition, the condition for two warnings was incorrect.
      
      Fixes: d1442d85Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      [Add #ifdef CONFIG_X86_64 to avoid complaints of undefined behavior. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      77e4d28c
    • Jens Axboe's avatar
      genhd: fix leftover might_sleep() in blk_free_devt() · 607530bf
      Jens Axboe authored
      commit 46f341ff upstream.
      
      Commit 2da78092 changed the locking from a mutex to a spinlock,
      so we now longer sleep in this context. But there was a leftover
      might_sleep() in there, which now triggers since we do the final
      free from an RCU callback. Get rid of it.
      Reported-by: default avatarPontus Fuchs <pontus.fuchs@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      607530bf
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Fix infinite spin in reading buffer · 77179bb8
      Steven Rostedt (Red Hat) authored
      commit 24607f11 upstream.
      
      Commit 651e22f2 "ring-buffer: Always reset iterator to reader page"
      fixed one bug but in the process caused another one. The reset is to
      update the header page, but that fix also changed the way the cached
      reads were updated. The cache reads are used to test if an iterator
      needs to be updated or not.
      
      A ring buffer iterator, when created, disables writes to the ring buffer
      but does not stop other readers or consuming reads from happening.
      Although all readers are synchronized via a lock, they are only
      synchronized when in the ring buffer functions. Those functions may
      be called by any number of readers. The iterator continues down when
      its not interrupted by a consuming reader. If a consuming read
      occurs, the iterator starts from the beginning of the buffer.
      
      The way the iterator sees that a consuming read has happened since
      its last read is by checking the reader "cache". The cache holds the
      last counts of the read and the reader page itself.
      
      Commit 651e22f2 changed what was saved by the cache_read when
      the rb_iter_reset() occurred, making the iterator never match the cache.
      Then if the iterator calls rb_iter_reset(), it will go into an
      infinite loop by checking if the cache doesn't match, doing the reset
      and retrying, just to see that the cache still doesn't match! Which
      should never happen as the reset is suppose to set the cache to the
      current value and there's locks that keep a consuming reader from
      having access to the data.
      
      Fixes: 651e22f2 "ring-buffer: Always reset iterator to reader page"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      77179bb8
    • Julian Anastasov's avatar
      ipvs: avoid netns exit crash on ip_vs_conn_drop_conntrack · 7756f3a8
      Julian Anastasov authored
      commit 2627b7e1 upstream.
      
      commit 8f4e0a18 ("IPVS netns exit causes crash in conntrack")
      added second ip_vs_conn_drop_conntrack call instead of just adding
      the needed check. As result, the first call still can cause
      crash on netns exit. Remove it.
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarHans Schillstrom <hans@schillstrom.com>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7756f3a8
    • Sergio Gelato's avatar
      nfsd: Fix ACL null pointer deref · 5b6da64a
      Sergio Gelato authored
      BugLink: http://bugs.launchpad.net/bugs/1348670
      
      Fix regression introduced in pre-3.14 kernels by cherry-picking
      aa07c713
      (NFSD: Call ->set_acl with a NULL ACL structure if no entries).
      
      The affected code was removed in 3.14 by commit
      4ac7249e
      (nfsd: use get_acl and ->set_acl).
      The ->set_acl methods are already able to cope with a NULL argument.
      Signed-off-by: default avatarSergio Gelato <Sergio.Gelato@astro.su.se>
      [bwh: Rewrite the subject]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      5b6da64a
    • Jan Kara's avatar
      ext2: Fix fs corruption in ext2_get_xip_mem() · 4b808fd2
      Jan Kara authored
      commit 7ba3ec57 upstream.
      
      Commit 8e3dffc6 "Ext2: mark inode dirty after the function
      dquot_free_block_nodirty is called" unveiled a bug in __ext2_get_block()
      called from ext2_get_xip_mem(). That function called ext2_get_block()
      mistakenly asking it to map 0 blocks while 1 was intended. Before the
      above mentioned commit things worked out fine by luck but after that commit
      we started returning that we allocated 0 blocks while we in fact
      allocated 1 block and thus allocation was looping until all blocks in
      the filesystem were exhausted.
      
      Fix the problem by properly asking for one block and also add assertion
      in ext2_get_blocks() to catch similar problems.
      Reported-and-tested-by: default avatarAndiry Xu <andiry.xu@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4b808fd2
    • Mikulas Patocka's avatar
      dm crypt: fix access beyond the end of allocated space · 15004af9
      Mikulas Patocka authored
      commit d49ec52f upstream.
      
      The DM crypt target accesses memory beyond allocated space resulting in
      a crash on 32 bit x86 systems.
      
      This bug is very old (it dates back to 2.6.25 commit 3a7f6c99 "dm
      crypt: use async crypto").  However, this bug was masked by the fact
      that kmalloc rounds the size up to the next power of two.  This bug
      wasn't exposed until 3.17-rc1 commit 298a9fa0 ("dm crypt: use per-bio
      data").  By switching to using per-bio data there was no longer any
      padding beyond the end of a dm-crypt allocated memory block.
      
      To minimize allocation overhead dm-crypt puts several structures into one
      block allocated with kmalloc.  The block holds struct ablkcipher_request,
      cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))),
      struct dm_crypt_request and an initialization vector.
      
      The variable dmreq_start is set to offset of struct dm_crypt_request
      within this memory block.  dm-crypt allocates the block with this size:
      cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size.
      
      When accessing the initialization vector, dm-crypt uses the function
      iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq
      + 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1).
      
      dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request
      structure.  However, when dm-crypt accesses the initialization vector, it
      takes a pointer to the end of dm_crypt_request, aligns it, and then uses
      it as the initialization vector.  If the end of dm_crypt_request is not
      aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the
      alignment causes the initialization vector to point beyond the allocated
      space.
      
      Fix this bug by calculating the variable iv_size_padding and adding it
      to the allocated size.
      
      Also correct the alignment of dm_crypt_request.  struct dm_crypt_request
      is specific to dm-crypt (it isn't used by the crypto subsystem at all),
      so it is aligned on __alignof__(struct dm_crypt_request).
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      15004af9
    • Andy Lutomirski's avatar
      x86,kvm,vmx: Preserve CR4 across VM entry · 9e793c5e
      Andy Lutomirski authored
      commit d974baa3 upstream.
      
      CR4 isn't constant; at least the TSD and PCE bits can vary.
      
      TBH, treating CR0 and CR3 as constant scares me a bit, too, but it looks
      like it's correct.
      
      This adds a branch and a read from cr4 to each vm entry.  Because it is
      extremely likely that consecutive entries into the same vcpu will have
      the same host cr4 value, this fixes up the vmcs instead of restoring cr4
      after the fact.  A subsequent patch will add a kernel-wide cr4 shadow,
      reducing the overhead in the common case to just two memory reads and a
      branch.
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Gleb Natapov <gleb@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2:
       - Adjust context
       - Add struct vcpu_vmx *vmx parameter to vmx_set_constant_host_state(), done
         upstream in commit a547c6db ("KVM: VMX: Enable acknowledge interupt
         on vmexit")]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9e793c5e
    • Daniel Borkmann's avatar
      net: sctp: fix remote memory pressure from excessive queueing · 3a8c709b
      Daniel Borkmann authored
      commit 26b87c78 upstream.
      
      This scenario is not limited to ASCONF, just taken as one
      example triggering the issue. When receiving ASCONF probes
      in the form of ...
      
        -------------- INIT[ASCONF; ASCONF_ACK] ------------->
        <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
        ---- ASCONF_a; [ASCONF_b; ...; ASCONF_n;] JUNK ------>
        [...]
        ---- ASCONF_m; [ASCONF_o; ...; ASCONF_z;] JUNK ------>
      
      ... where ASCONF_a, ASCONF_b, ..., ASCONF_z are good-formed
      ASCONFs and have increasing serial numbers, we process such
      ASCONF chunk(s) marked with !end_of_packet and !singleton,
      since we have not yet reached the SCTP packet end. SCTP does
      only do verification on a chunk by chunk basis, as an SCTP
      packet is nothing more than just a container of a stream of
      chunks which it eats up one by one.
      
      We could run into the case that we receive a packet with a
      malformed tail, above marked as trailing JUNK. All previous
      chunks are here goodformed, so the stack will eat up all
      previous chunks up to this point. In case JUNK does not fit
      into a chunk header and there are no more other chunks in
      the input queue, or in case JUNK contains a garbage chunk
      header, but the encoded chunk length would exceed the skb
      tail, or we came here from an entirely different scenario
      and the chunk has pdiscard=1 mark (without having had a flush
      point), it will happen, that we will excessively queue up
      the association's output queue (a correct final chunk may
      then turn it into a response flood when flushing the
      queue ;)): I ran a simple script with incremental ASCONF
      serial numbers and could see the server side consuming
      excessive amount of RAM [before/after: up to 2GB and more].
      
      The issue at heart is that the chunk train basically ends
      with !end_of_packet and !singleton markers and since commit
      2e3216cd ("sctp: Follow security requirement of responding
      with 1 packet") therefore preventing an output queue flush
      point in sctp_do_sm() -> sctp_cmd_interpreter() on the input
      chunk (chunk = event_arg) even though local_cork is set,
      but its precedence has changed since then. In the normal
      case, the last chunk with end_of_packet=1 would trigger the
      queue flush to accommodate possible outgoing bundling.
      
      In the input queue, sctp_inq_pop() seems to do the right thing
      in terms of discarding invalid chunks. So, above JUNK will
      not enter the state machine and instead be released and exit
      the sctp_assoc_bh_rcv() chunk processing loop. It's simply
      the flush point being missing at loop exit. Adding a try-flush
      approach on the output queue might not work as the underlying
      infrastructure might be long gone at this point due to the
      side-effect interpreter run.
      
      One possibility, albeit a bit of a kludge, would be to defer
      invalid chunk freeing into the state machine in order to
      possibly trigger packet discards and thus indirectly a queue
      flush on error. It would surely be better to discard chunks
      as in the current, perhaps better controlled environment, but
      going back and forth, it's simply architecturally not possible.
      I tried various trailing JUNK attack cases and it seems to
      look good now.
      
      Joint work with Vlad Yasevich.
      
      Fixes: 2e3216cd ("sctp: Follow security requirement of responding with 1 packet")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3a8c709b
    • Daniel Borkmann's avatar
      net: sctp: fix panic on duplicate ASCONF chunks · 9a3c6f2e
      Daniel Borkmann authored
      commit b69040d8 upstream.
      
      When receiving a e.g. semi-good formed connection scan in the
      form of ...
      
        -------------- INIT[ASCONF; ASCONF_ACK] ------------->
        <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
        ---------------- ASCONF_a; ASCONF_b ----------------->
      
      ... where ASCONF_a equals ASCONF_b chunk (at least both serials
      need to be equal), we panic an SCTP server!
      
      The problem is that good-formed ASCONF chunks that we reply with
      ASCONF_ACK chunks are cached per serial. Thus, when we receive a
      same ASCONF chunk twice (e.g. through a lost ASCONF_ACK), we do
      not need to process them again on the server side (that was the
      idea, also proposed in the RFC). Instead, we know it was cached
      and we just resend the cached chunk instead. So far, so good.
      
      Where things get nasty is in SCTP's side effect interpreter, that
      is, sctp_cmd_interpreter():
      
      While incoming ASCONF_a (chunk = event_arg) is being marked
      !end_of_packet and !singleton, and we have an association context,
      we do not flush the outqueue the first time after processing the
      ASCONF_ACK singleton chunk via SCTP_CMD_REPLY. Instead, we keep it
      queued up, although we set local_cork to 1. Commit 2e3216cd
      changed the precedence, so that as long as we get bundled, incoming
      chunks we try possible bundling on outgoing queue as well. Before
      this commit, we would just flush the output queue.
      
      Now, while ASCONF_a's ASCONF_ACK sits in the corked outq, we
      continue to process the same ASCONF_b chunk from the packet. As
      we have cached the previous ASCONF_ACK, we find it, grab it and
      do another SCTP_CMD_REPLY command on it. So, effectively, we rip
      the chunk->list pointers and requeue the same ASCONF_ACK chunk
      another time. Since we process ASCONF_b, it's correctly marked
      with end_of_packet and we enforce an uncork, and thus flush, thus
      crashing the kernel.
      
      Fix it by testing if the ASCONF_ACK is currently pending and if
      that is the case, do not requeue it. When flushing the output
      queue we may relink the chunk for preparing an outgoing packet,
      but eventually unlink it when it's copied into the skb right
      before transmission.
      
      Joint work with Vlad Yasevich.
      
      Fixes: 2e3216cd ("sctp: Follow security requirement of responding with 1 packet")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9a3c6f2e
    • Daniel Borkmann's avatar
      net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks · aa001b04
      Daniel Borkmann authored
      commit 9de7922b upstream.
      
      Commit 6f4c618d ("SCTP : Add paramters validity check for
      ASCONF chunk") added basic verification of ASCONF chunks, however,
      it is still possible to remotely crash a server by sending a
      special crafted ASCONF chunk, even up to pre 2.6.12 kernels:
      
      skb_over_panic: text:ffffffffa01ea1c3 len:31056 put:30768
       head:ffff88011bd81800 data:ffff88011bd81800 tail:0x7950
       end:0x440 dev:<NULL>
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:129!
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff8144fb1c>] skb_put+0x5c/0x70
       [<ffffffffa01ea1c3>] sctp_addto_chunk+0x63/0xd0 [sctp]
       [<ffffffffa01eadaf>] sctp_process_asconf+0x1af/0x540 [sctp]
       [<ffffffff8152d025>] ? _read_unlock_bh+0x15/0x20
       [<ffffffffa01e0038>] sctp_sf_do_asconf+0x168/0x240 [sctp]
       [<ffffffffa01e3751>] sctp_do_sm+0x71/0x1210 [sctp]
       [<ffffffff8147645d>] ? fib_rules_lookup+0xad/0xf0
       [<ffffffffa01e6b22>] ? sctp_cmp_addr_exact+0x32/0x40 [sctp]
       [<ffffffffa01e8393>] sctp_assoc_bh_rcv+0xd3/0x180 [sctp]
       [<ffffffffa01ee986>] sctp_inq_push+0x56/0x80 [sctp]
       [<ffffffffa01fcc42>] sctp_rcv+0x982/0xa10 [sctp]
       [<ffffffffa01d5123>] ? ipt_local_in_hook+0x23/0x28 [iptable_filter]
       [<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0
       [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
       [<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120
       [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
       [<ffffffff81496ded>] ip_local_deliver_finish+0xdd/0x2d0
       [<ffffffff81497078>] ip_local_deliver+0x98/0xa0
       [<ffffffff8149653d>] ip_rcv_finish+0x12d/0x440
       [<ffffffff81496ac5>] ip_rcv+0x275/0x350
       [<ffffffff8145c88b>] __netif_receive_skb+0x4ab/0x750
       [<ffffffff81460588>] netif_receive_skb+0x58/0x60
      
      This can be triggered e.g., through a simple scripted nmap
      connection scan injecting the chunk after the handshake, for
      example, ...
      
        -------------- INIT[ASCONF; ASCONF_ACK] ------------->
        <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
        ------------------ ASCONF; UNKNOWN ------------------>
      
      ... where ASCONF chunk of length 280 contains 2 parameters ...
      
        1) Add IP address parameter (param length: 16)
        2) Add/del IP address parameter (param length: 255)
      
      ... followed by an UNKNOWN chunk of e.g. 4 bytes. Here, the
      Address Parameter in the ASCONF chunk is even missing, too.
      This is just an example and similarly-crafted ASCONF chunks
      could be used just as well.
      
      The ASCONF chunk passes through sctp_verify_asconf() as all
      parameters passed sanity checks, and after walking, we ended
      up successfully at the chunk end boundary, and thus may invoke
      sctp_process_asconf(). Parameter walking is done with
      WORD_ROUND() to take padding into account.
      
      In sctp_process_asconf()'s TLV processing, we may fail in
      sctp_process_asconf_param() e.g., due to removal of the IP
      address that is also the source address of the packet containing
      the ASCONF chunk, and thus we need to add all TLVs after the
      failure to our ASCONF response to remote via helper function
      sctp_add_asconf_response(), which basically invokes a
      sctp_addto_chunk() adding the error parameters to the given
      skb.
      
      When walking to the next parameter this time, we proceed
      with ...
      
        length = ntohs(asconf_param->param_hdr.length);
        asconf_param = (void *)asconf_param + length;
      
      ... instead of the WORD_ROUND()'ed length, thus resulting here
      in an off-by-one that leads to reading the follow-up garbage
      parameter length of 12336, and thus throwing an skb_over_panic
      for the reply when trying to sctp_addto_chunk() next time,
      which implicitly calls the skb_put() with that length.
      
      Fix it by using sctp_walk_params() [ which is also used in
      INIT parameter processing ] macro in the verification *and*
      in ASCONF processing: it will make sure we don't spill over,
      that we walk parameters WORD_ROUND()'ed. Moreover, we're being
      more defensive and guard against unknown parameter types and
      missized addresses.
      
      Joint work with Vlad Yasevich.
      
      Fixes: b896b82b ("[SCTP] ADDIP: Support for processing incoming ASCONF_ACK chunks.")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2:
       - Adjust context
       - sctp_sf_violation_paramlen() doesn't take a struct net * parameter]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      aa001b04
    • Nadav Amit's avatar
      KVM: x86: Handle errors when RIP is set during far jumps · f8a2b85d
      Nadav Amit authored
      commit d1442d85 upstream.
      
      Far jmp/call/ret may fault while loading a new RIP.  Currently KVM does not
      handle this case, and may result in failed vm-entry once the assignment is
      done.  The tricky part of doing so is that loading the new CS affects the
      VMCS/VMCB state, so if we fail during loading the new RIP, we are left in
      unconsistent state.  Therefore, this patch saves on 64-bit the old CS
      descriptor and restores it if loading RIP failed.
      
      This fixes CVE-2014-3647.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2:
       - Adjust context
       - __load_segment_descriptor() does not take an in_task_switch parameter]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f8a2b85d
    • Paolo Bonzini's avatar
      KVM: x86: use new CS.RPL as CPL during task switch · 11c0bdb6
      Paolo Bonzini authored
      commit 2356aaeb upstream.
      
      During task switch, all of CS.DPL, CS.RPL, SS.DPL must match (in addition
      to all the other requirements) and will be the new CPL.  So far this
      worked by carefully setting the CS selector and flag before doing the
      task switch; setting CS.selector will already change the CPL.
      
      However, this will not work once we get the CPL from SS.DPL, because
      then you will have to set the full segment descriptor cache to change
      the CPL.  ctxt->ops->cpl(ctxt) will then return the old CPL during the
      task switch, and the check that SS.DPL == CPL will fail.
      
      Temporarily assume that the CPL comes from CS.RPL during task switch
      to a protected-mode task.  This is the same approach used in QEMU's
      emulation code, which (until version 2.0) manually tracks the CPL.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2:
       - Adjust context
       - load_state_from_tss32() does not support VM86 mode]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      11c0bdb6
    • Nadav Amit's avatar
      KVM: x86: Emulator fixes for eip canonical checks on near branches · 71ca9dc3
      Nadav Amit authored
      commit 234f3ce4 upstream.
      
      Before changing rip (during jmp, call, ret, etc.) the target should be asserted
      to be canonical one, as real CPUs do.  During sysret, both target rsp and rip
      should be canonical. If any of these values is noncanonical, a #GP exception
      should occur.  The exception to this rule are syscall and sysenter instructions
      in which the assigned rip is checked during the assignment to the relevant
      MSRs.
      
      This patch fixes the emulator to behave as real CPUs do for near branches.
      Far branches are handled by the next patch.
      
      This fixes CVE-2014-3647.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2:
       - Adjust context
       - Use ctxt->regs[] instead of reg_read(), reg_write(), reg_rmw()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      71ca9dc3
    • Nadav Amit's avatar
      KVM: x86: Fix wrong masking on relative jump/call · ea8064a2
      Nadav Amit authored
      commit 05c83ec9 upstream.
      
      Relative jumps and calls do the masking according to the operand size, and not
      according to the address size as the KVM emulator does today.
      
      This patch fixes KVM behavior.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ea8064a2
    • Takuya Yoshikawa's avatar
      KVM: x86 emulator: Use opcode::execute for CALL · befadafe
      Takuya Yoshikawa authored
      commit d4ddafcd upstream.
      
      CALL: E8
      Signed-off-by: default avatarTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      befadafe
    • Petr Matousek's avatar
      kvm: vmx: handle invvpid vm exit gracefully · 3f09b1f1
      Petr Matousek authored
      commit a642fc30 upstream.
      
      On systems with invvpid instruction support (corresponding bit in
      IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid
      causes vm exit, which is currently not handled and results in
      propagation of unknown exit to userspace.
      
      Fix this by installing an invvpid vm exit handler.
      
      This is CVE-2014-3646.
      Signed-off-by: default avatarPetr Matousek <pmatouse@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2:
       - Adjust filename
       - Drop inapplicable change to exit reason string array]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3f09b1f1
    • Nadav Har'El's avatar
      nEPT: Nested INVEPT · 02a988e6
      Nadav Har'El authored
      commit bfd0a56b upstream.
      
      If we let L1 use EPT, we should probably also support the INVEPT instruction.
      
      In our current nested EPT implementation, when L1 changes its EPT table
      for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in
      the course of this modification already calls INVEPT. But if last level
      of shadow page is unsync not all L1's changes to EPT12 are intercepted,
      which means roots need to be synced when L1 calls INVEPT. Global INVEPT
      should not be different since roots are synced by kvm_mmu_load() each
      time EPTP02 changes.
      Reviewed-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: default avatarNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: default avatarJun Nakajima <jun.nakajima@intel.com>
      Signed-off-by: default avatarXinhao Xu <xinhao.xu@intel.com>
      Signed-off-by: default avatarYang Zhang <yang.z.zhang@Intel.com>
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2:
       - Adjust context, filename
       - Simplify handle_invept() as recommended by Paolo - nEPT is not
         supported so we always raise #UD]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      02a988e6
    • Andy Honig's avatar
      KVM: x86: Improve thread safety in pit · 30a340f5
      Andy Honig authored
      commit 2febc839 upstream.
      
      There's a race condition in the PIT emulation code in KVM.  In
      __kvm_migrate_pit_timer the pit_timer object is accessed without
      synchronization.  If the race condition occurs at the wrong time this
      can crash the host kernel.
      
      This fixes CVE-2014-3611.
      Signed-off-by: default avatarAndrew Honig <ahonig@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      30a340f5
    • Nadav Amit's avatar
      KVM: x86: Check non-canonical addresses upon WRMSR · 76715b56
      Nadav Amit authored
      commit 854e8bb1 upstream.
      
      Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
      written to certain MSRs. The behavior is "almost" identical for AMD and Intel
      (ignoring MSRs that are not implemented in either architecture since they would
      anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
      non-canonical address is written on Intel but not on AMD (which ignores the top
      32-bits).
      
      Accordingly, this patch injects a #GP on the MSRs which behave identically on
      Intel and AMD.  To eliminate the differences between the architecutres, the
      value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
      canonical value before writing instead of injecting a #GP.
      
      Some references from Intel and AMD manuals:
      
      According to Intel SDM description of WRMSR instruction #GP is expected on
      WRMSR "If the source register contains a non-canonical address and ECX
      specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
      IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
      
      According to AMD manual instruction manual:
      LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
      LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
      form, a general-protection exception (#GP) occurs."
      IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
      base field must be in canonical form or a #GP fault will occur."
      IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
      be in canonical form."
      
      This patch fixes CVE-2014-3610.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2:
       - The various set_msr() functions all separate msr_index and data parameters]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      76715b56
    • Hannes Frederic Sowa's avatar
      ipv6: reuse ip6_frag_id from ip6_ufo_append_data · 8db33010
      Hannes Frederic Sowa authored
      commit 916e4cf4 upstream.
      
      Currently we generate a new fragmentation id on UFO segmentation. It
      is pretty hairy to identify the correct net namespace and dst there.
      Especially tunnels use IFF_XMIT_DST_RELEASE and thus have no skb_dst
      available at all.
      
      This causes unreliable or very predictable ipv6 fragmentation id
      generation while segmentation.
      
      Luckily we already have pregenerated the ip6_frag_id in
      ip6_ufo_append_data and can use it here.
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2: adjust filename, indentation]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      8db33010
    • Theodore Ts'o's avatar
      ext4: fix BUG_ON in mb_free_blocks() · 4c844312
      Theodore Ts'o authored
      commit c99d1e6e upstream.
      
      If we suffer a block allocation failure (for example due to a memory
      allocation failure), it's possible that we will call
      ext4_discard_allocated_blocks() before we've actually allocated any
      blocks.  In that case, fe_len and fe_start in ac->ac_f_ex will still
      be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
      triggering the BUG_ON on mb_free_blocks():
      
      	BUG_ON(last >= (sb->s_blocksize << 3));
      
      Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
      is zero.
      
      Also fix a missing ext4_mb_unload_buddy() call in
      ext4_discard_allocated_blocks().
      
      Google-Bug-Id: 16844242
      
      Fixes: 86f0afd4Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4c844312
    • chenweilong's avatar
      ipv6: reallocate addrconf router for ipv6 address when lo device up · a0a8667a
      chenweilong authored
      It fix the bug 67951 on bugzilla
      https://bugzilla.kernel.org/show_bug.cgi?id=67951
      
      The patch can't be applied directly, as it' used the function introduced
      by "commit 94e187c0" ip6_rt_put(), that patch can't be applied directly
      either.
      
      ====================
      
      From: Gao feng <gaofeng@cn.fujitsu.com>
      
      commit 33d99113 upstream.
      
      This commit don't have a stable tag, but it fix the bug
      no reply after loopback down-up.It's very worthy to be
      applied to stable 3.4 kernels.
      
      The bug is 67951 on bugzilla
      https://bugzilla.kernel.org/show_bug.cgi?id=67951
      
      
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: default avatarWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: default avatarWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: default avatarGao feng <gaofeng@cn.fujitsu.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [weilong: s/ip6_rt_put/dst_release]
      Signed-off-by: default avatarChen Weilong <chenweilong@huawei.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a0a8667a
    • Marcelo Ricardo Leitner's avatar
      ipv4: disable bh while doing route gc · 4715883b
      Marcelo Ricardo Leitner authored
      Further tests revealed that after moving the garbage collector to a work
      queue and protecting it with a spinlock may leave the system prone to
      soft lockups if bottom half gets very busy.
      
      It was reproced with a set of firewall rules that REJECTed packets. If
      the NIC bottom half handler ends up running on the same CPU that is
      running the garbage collector on a very large cache, the garbage
      collector will not be able to do its job due to the amount of work
      needed for handling the REJECTs and also won't reschedule.
      
      The fix is to disable bottom half during the garbage collecting, as it
      already was in the first place (most calls to it came from softirqs).
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4715883b
    • Marcelo Ricardo Leitner's avatar
      ipv4: avoid parallel route cache gc executions · ad5ca98f
      Marcelo Ricardo Leitner authored
      When rt_intern_hash() has to deal with neighbour cache overflowing,
      it triggers the route cache garbage collector in an attempt to free
      some references on neighbour entries.
      
      Such call cannot be done async but should also not run in parallel with
      an already-running one, so that they don't collapse fighting over the
      hash lock entries.
      
      This patch thus blocks parallel executions with spinlocks:
      - A call from worker and from rt_intern_hash() are not the same, and
      cannot be merged, thus they will wait each other on rt_gc_lock.
      - Calls to gc from rt_intern_hash() may happen in parallel but we must
      wait for it to finish in order to try again. This dedup and
      synchrinozation is then performed by the locking just before calling
      __do_rt_garbage_collect().
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ad5ca98f
    • Marcelo Ricardo Leitner's avatar
      ipv4: move route garbage collector to work queue · 6c383b3a
      Marcelo Ricardo Leitner authored
      Currently the route garbage collector gets called by dst_alloc() if it
      have more entries than the threshold. But it's an expensive call, that
      don't really need to be done by then.
      
      Another issue with current way is that it allows running the garbage
      collector with the same start parameters on multiple CPUs at once, which
      is not optimal. A system may even soft lockup if the cache is big enough
      as the garbage collectors will be fighting over the hash lock entries.
      
      This patch thus moves the garbage collector to run asynchronously on a
      work queue, much similar to how rt_expire_check runs.
      
      There is one condition left that allows multiple executions, which is
      handled by the next patch.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6c383b3a
    • Yoichi Yuasa's avatar
      MIPS: Fix forgotten preempt_enable() when CPU has inclusive pcaches · d521f4ba
      Yoichi Yuasa authored
      commit 5596b0b2 upstream.
      
      [    1.904000] BUG: scheduling while atomic: swapper/1/0x00000002
      [    1.908000] Modules linked in:
      [    1.916000] CPU: 0 PID: 1 Comm: swapper Not tainted 3.12.0-rc2-lemote-los.git-5318619-dirty #1
      [    1.920000] Stack : 0000000031aac000 ffffffff810d0000 0000000000000052 ffffffff802730a4
                0000000000000000 0000000000000001 ffffffff810cdf90 ffffffff810d0000
                ffffffff8068b968 ffffffff806f5537 ffffffff810cdf90 980000009f0782e8
                0000000000000001 ffffffff80720000 ffffffff806b0000 980000009f078000
                980000009f290000 ffffffff805f312c 980000009f05b5d8 ffffffff80233518
                980000009f05b5e8 ffffffff80274b7c 980000009f078000 ffffffff8068b968
                0000000000000000 0000000000000000 0000000000000000 0000000000000000
                0000000000000000 980000009f05b520 0000000000000000 ffffffff805f2f6c
                0000000000000000 ffffffff80700000 ffffffff80700000 ffffffff806fc758
                ffffffff80700000 ffffffff8020be98 ffffffff806fceb0 ffffffff805f2f6c
                ...
      [    2.028000] Call Trace:
      [    2.032000] [<ffffffff8020be98>] show_stack+0x80/0x98
      [    2.036000] [<ffffffff805f2f6c>] __schedule_bug+0x44/0x6c
      [    2.040000] [<ffffffff805fac58>] __schedule+0x518/0x5b0
      [    2.044000] [<ffffffff805f8a58>] schedule_timeout+0x128/0x1f0
      [    2.048000] [<ffffffff80240314>] msleep+0x3c/0x60
      [    2.052000] [<ffffffff80495400>] do_probe+0x238/0x3a8
      [    2.056000] [<ffffffff804958b0>] ide_probe_port+0x340/0x7e8
      [    2.060000] [<ffffffff80496028>] ide_host_register+0x2d0/0x7a8
      [    2.064000] [<ffffffff8049c65c>] ide_pci_init_two+0x4e4/0x790
      [    2.068000] [<ffffffff8049f9b8>] amd74xx_probe+0x148/0x2c8
      [    2.072000] [<ffffffff803f571c>] pci_device_probe+0xc4/0x130
      [    2.076000] [<ffffffff80478f60>] driver_probe_device+0x98/0x270
      [    2.080000] [<ffffffff80479298>] __driver_attach+0xe0/0xe8
      [    2.084000] [<ffffffff80476ab0>] bus_for_each_dev+0x78/0xe0
      [    2.088000] [<ffffffff80478468>] bus_add_driver+0x230/0x310
      [    2.092000] [<ffffffff80479b44>] driver_register+0x84/0x158
      [    2.096000] [<ffffffff80200504>] do_one_initcall+0x104/0x160
      Signed-off-by: default avatarYoichi Yuasa <yuasa@linux-mips.org>
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Cc: linux-mips@linux-mips.org
      Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
      Patchwork: https://patchwork.linux-mips.org/patch/5941/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d521f4ba
    • Josh Triplett's avatar
      init/Kconfig: Hide printk log config if CONFIG_PRINTK=n · ff872daa
      Josh Triplett authored
      commit 361e9dfb upstream.
      
      The buffers sized by CONFIG_LOG_BUF_SHIFT and
      CONFIG_LOG_CPU_MAX_BUF_SHIFT do not exist if CONFIG_PRINTK=n, so don't
      ask about their size at all.
      Signed-off-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      [bwh: Backported to 3.2: drop change to CONFIG_LOG_CPU_MAX_BUF_SHIFT]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ff872daa
    • Peter Zijlstra's avatar
      perf: fix perf bug in fork() · 96cb09b8
      Peter Zijlstra authored
      commit 6c72e350 upstream.
      
      Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
      calling perf_event_free_task() when failing sched_fork() we will not yet
      have done the memset() on ->perf_event_ctxp[] and will therefore try and
      'free' the inherited contexts, which are still in use by the parent
      process.  This is bad..
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      96cb09b8
    • Mel Gorman's avatar
      mm: migrate: Close race between migration completion and mprotect · b47191f7
      Mel Gorman authored
      commit d3cb8bf6 upstream.
      
      A migration entry is marked as write if pte_write was true at the time the
      entry was created. The VMA protections are not double checked when migration
      entries are being removed as mprotect marks write-migration-entries as
      read. It means that potentially we take a spurious fault to mark PTEs write
      again but it's straight-forward. However, there is a race between write
      migrations being marked read and migrations finishing. This potentially
      allows a PTE to be write that should have been read. Close this race by
      double checking the VMA permissions using maybe_mkwrite when migration
      completes.
      
      [torvalds@linux-foundation.org: use maybe_mkwrite]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b47191f7
    • Miklos Szeredi's avatar
      shmem: fix nlink for rename overwrite directory · 6bf3b2e3
      Miklos Szeredi authored
      commit b928095b upstream.
      
      If overwriting an empty directory with rename, then need to drop the extra
      nlink.
      
      Test prog:
      
      #include <stdio.h>
      #include <fcntl.h>
      #include <err.h>
      #include <sys/stat.h>
      
      int main(void)
      {
      	const char *test_dir1 = "test-dir1";
      	const char *test_dir2 = "test-dir2";
      	int res;
      	int fd;
      	struct stat statbuf;
      
      	res = mkdir(test_dir1, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir1);
      
      	res = mkdir(test_dir2, 0777);
      	if (res == -1)
      		err(1, "mkdir(\"%s\")", test_dir2);
      
      	fd = open(test_dir2, O_RDONLY);
      	if (fd == -1)
      		err(1, "open(\"%s\")", test_dir2);
      
      	res = rename(test_dir1, test_dir2);
      	if (res == -1)
      		err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
      
      	res = fstat(fd, &statbuf);
      	if (res == -1)
      		err(1, "fstat(%i)", fd);
      
      	if (statbuf.st_nlink != 0) {
      		fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
      		return 1;
      	}
      
      	return 0;
      }
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6bf3b2e3
    • Joseph Qi's avatar
      ocfs2/dlm: do not get resource spinlock if lockres is new · 40d31c49
      Joseph Qi authored
      commit 5760a97c upstream.
      
      There is a deadlock case which reported by Guozhonghua:
        https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html
      
      This case is caused by &res->spinlock and &dlm->master_lock
      misordering in different threads.
      
      It was introduced by commit 8d400b81 ("ocfs2/dlm: Clean up refmap
      helpers").  Since lockres is new, it doesn't not require the
      &res->spinlock.  So remove it.
      
      Fixes: 8d400b81 ("ocfs2/dlm: Clean up refmap helpers")
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Reported-by: default avatarGuozhonghua <guozhonghua@h3c.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      40d31c49
    • Andreas Rohner's avatar
      nilfs2: fix data loss with mmap() · a428c0e1
      Andreas Rohner authored
      commit 56d7acc7 upstream.
      
      This bug leads to reproducible silent data loss, despite the use of
      msync(), sync() and a clean unmount of the file system.  It is easily
      reproducible with the following script:
      
        ----------------[BEGIN SCRIPT]--------------------
        mkfs.nilfs2 -f /dev/sdb
        mount /dev/sdb /mnt
      
        dd if=/dev/zero bs=1M count=30 of=/mnt/testfile
      
        umount /mnt
        mount /dev/sdb /mnt
        CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"
      
        /root/mmaptest/mmaptest /mnt/testfile 30 10 5
      
        sync
        CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
        umount /mnt
        mount /dev/sdb /mnt
        CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
        umount /mnt
      
        echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
        echo "AFTER MMAP:\t$CHECKSUM_AFTER"
        echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
        ----------------[END SCRIPT]--------------------
      
      The mmaptest tool looks something like this (very simplified, with
      error checking removed):
      
        ----------------[BEGIN mmaptest]--------------------
        data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
                    MAP_SHARED, fd, file_offset);
      
        for (i = 0; i < write_count; ++i) {
              memcpy(data + i * 4096, buf, sizeof(buf));
              msync(data, file_size - file_offset, MS_SYNC))
        }
        ----------------[END mmaptest]--------------------
      
      The output of the script looks something like this:
      
        BEFORE MMAP:    281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
        AFTER MMAP:     6604a1c31f10780331a6850371b3a313  /mnt/testfile
        AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
      
      So it is clear, that the changes done using mmap() do not survive a
      remount.  This can be reproduced a 100% of the time.  The problem was
      introduced in commit 136e8770 ("nilfs2: fix issue of
      nilfs_set_page_dirty() for page at EOF boundary").
      
      If the page was read with mpage_readpage() or mpage_readpages() for
      example, then it has no buffers attached to it.  In that case
      page_has_buffers(page) in nilfs_set_page_dirty() will be false.
      Therefore nilfs_set_file_dirty() is never called and the pages are never
      collected and never written to disk.
      
      This patch fixes the problem by also calling nilfs_set_file_dirty() if the
      page has no buffers attached to it.
      
      [akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/]
      Signed-off-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Tested-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a428c0e1
    • Markos Chandras's avatar
      MIPS: mcount: Adjust stack pointer for static trace in MIPS32 · bbc3708a
      Markos Chandras authored
      commit 8a574cfa upstream.
      
      Every mcount() call in the MIPS 32-bit kernel is done as follows:
      
      [...]
      move at, ra
      jal _mcount
      addiu sp, sp, -8
      [...]
      
      but upon returning from the mcount() function, the stack pointer
      is not adjusted properly. This is explained in details in 58b69401
      (MIPS: Function tracer: Fix broken function tracing).
      
      Commit ad8c3969 ("MIPS: Unbreak function tracer for 64-bit kernel.)
      fixed the stack manipulation for 64-bit but it didn't fix it completely
      for MIPS32.
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/7792/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      bbc3708a
    • Robin Murphy's avatar
      ARM: 8165/1: alignment: don't break misaligned NEON load/store · edeb8f82
      Robin Murphy authored
      commit 5ca918e5 upstream.
      
      The alignment fixup incorrectly decodes faulting ARM VLDn/VSTn
      instructions (where the optional alignment hint is given but incorrect)
      as LDR/STR, leading to register corruption. Detect these and correctly
      treat them as unhandled, so that userspace gets the fault it expects.
      Reported-by: default avatarSimon Hosie <simon.hosie@arm.com>
      Signed-off-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      edeb8f82