1. 02 Sep, 2014 20 commits
    • David S. Miller's avatar
      sparc64: Fix top-level fault handling bugs. · 25be5155
      David S. Miller authored
      [ Upstream commit 70ffc6eb ]
      
      Make get_user_insn() able to cope with huge PMDs.
      
      Next, make do_fault_siginfo() more robust when get_user_insn() can't
      actually fetch the instruction.  In particular, use the MMU announced
      fault address when that happens, instead of calling
      compute_effective_address() and computing garbage.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      25be5155
    • David S. Miller's avatar
      sparc64: Handle 32-bit tasks properly in compute_effective_address(). · 3c2dce01
      David S. Miller authored
      [ Upstream commit d037d163 ]
      
      If we have a 32-bit task we must chop off the top 32-bits of the
      64-bit value just as the cpu would.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      3c2dce01
    • David S. Miller's avatar
      sparc64: Don't use _PAGE_PRESENT in pte_modify() mask. · 753516c9
      David S. Miller authored
      [ Upstream commit eaf85da8 ]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      753516c9
    • David S. Miller's avatar
      sparc64: Fix hex values in comment above pte_modify(). · f7a9655a
      David S. Miller authored
      [ Upstream commit c2e4e676 ]
      
      When _PAGE_SPECIAL and _PAGE_PMD_HUGE were added to the mask, the
      comment was not updated.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      f7a9655a
    • David S. Miller's avatar
      sparc64: Fix bugs in get_user_pages_fast() wrt. THP. · 504162bc
      David S. Miller authored
      [ Upstream commit 04df419d ]
      
      The large PMD path needs to check _PAGE_VALID not _PAGE_PRESENT, to
      decide if it needs to bail and return 0.
      
      pmd_large() should therefore just check _PAGE_PMD_HUGE.
      
      Calls to gup_huge_pmd() are guarded with a check of pmd_large(), so we
      just need to add a valid bit check.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      504162bc
    • David S. Miller's avatar
      sparc64: Fix huge PMD invalidation. · 1b2c548f
      David S. Miller authored
      [ Upstream commit 51e5ef1b ]
      
      On sparc64 "present" and "valid" are seperate PTE bits, this allows us to
      naturally distinguish between the user explicitly asking for PROT_NONE
      with mprotect() and other situations.
      
      However we weren't handling this properly in the huge PMD paths.
      
      First of all, the page table walker in the TSB miss path only checks
      for _PAGE_PMD_HUGE.  So the generic pmdp_invalidate() would clear
      _PAGE_PRESENT but the TLB miss paths would still load it into the TLB
      as a valid huge PMD.
      
      Fix this by clearing the valid bit in pmdp_invalidate(), and also
      checking the valid bit in USER_PGTABLE_CHECK_PMD_HUGE using "brgez"
      since _PAGE_VALID is bit 63 in both the sun4u and sun4v pte layouts.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      1b2c548f
    • David S. Miller's avatar
      sparc64: Fix executable bit testing in set_pmd_at() paths. · 4fcca4c9
      David S. Miller authored
      [ Upstream commit 5b1e94fa ]
      
      This code was mistakenly using the exec bit from the PMD in all
      cases, even when the PMD isn't a huge PMD.
      
      If it's not a huge PMD, test the exec bit in the individual ptes down
      in tlb_batch_pmd_scan().
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      4fcca4c9
    • Kirill Tkhai's avatar
      sparc64: Make itc_sync_lock raw · 1f070c25
      Kirill Tkhai authored
      [ Upstream commit 49b6c01f ]
      
      One more place where we must not be able
      to be preempted or to be interrupted in RT.
      
      Always actually disable interrupts during
      synchronization cycle.
      Signed-off-by: default avatarKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      1f070c25
    • David S. Miller's avatar
      sparc64: Fix argument sign extension for compat_sys_futex(). · eb7d14a2
      David S. Miller authored
      [ Upstream commit aa3449ee ]
      
      Only the second argument, 'op', is signed.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      eb7d14a2
    • Eric Dumazet's avatar
      sctp: fix possible seqlock seadlock in sctp_packet_transmit() · bf3da416
      Eric Dumazet authored
      [ Upstream commit 757efd32 ]
      
      Dave reported following splat, caused by improper use of
      IP_INC_STATS_BH() in process context.
      
      BUG: using __this_cpu_add() in preemptible [00000000] code: trinity-c117/14551
      caller is __this_cpu_preempt_check+0x13/0x20
      CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
       ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
       0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
       ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
      Call Trace:
       [<ffffffff9e7ee207>] dump_stack+0x4e/0x7a
       [<ffffffff9e397eaa>] check_preemption_disabled+0xfa/0x100
       [<ffffffff9e397ee3>] __this_cpu_preempt_check+0x13/0x20
       [<ffffffffc0839872>] sctp_packet_transmit+0x692/0x710 [sctp]
       [<ffffffffc082a7f2>] sctp_outq_flush+0x2a2/0xc30 [sctp]
       [<ffffffff9e0d985c>] ? mark_held_locks+0x7c/0xb0
       [<ffffffff9e7f8c6d>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
       [<ffffffffc082b99a>] sctp_outq_uncork+0x1a/0x20 [sctp]
       [<ffffffffc081e112>] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
       [<ffffffffc081c86b>] sctp_do_sm+0xdb/0x330 [sctp]
       [<ffffffff9e0b8f1b>] ? preempt_count_sub+0xab/0x100
       [<ffffffffc083b350>] ? sctp_cname+0x70/0x70 [sctp]
       [<ffffffffc08389ca>] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
       [<ffffffffc083358f>] sctp_sendmsg+0x88f/0xe30 [sctp]
       [<ffffffff9e0d673a>] ? lock_release_holdtime.part.28+0x9a/0x160
       [<ffffffff9e0d62ce>] ? put_lock_stats.isra.27+0xe/0x30
       [<ffffffff9e73b624>] inet_sendmsg+0x104/0x220
       [<ffffffff9e73b525>] ? inet_sendmsg+0x5/0x220
       [<ffffffff9e68ac4e>] sock_sendmsg+0x9e/0xe0
       [<ffffffff9e1c0c09>] ? might_fault+0xb9/0xc0
       [<ffffffff9e1c0bae>] ? might_fault+0x5e/0xc0
       [<ffffffff9e68b234>] SYSC_sendto+0x124/0x1c0
       [<ffffffff9e0136b0>] ? syscall_trace_enter+0x250/0x330
       [<ffffffff9e68c3ce>] SyS_sendto+0xe/0x10
       [<ffffffff9e7f9be4>] tracesys+0xdd/0xe2
      
      This is a followup of commits f1d8cba6 ("inet: fix possible
      seqlock deadlocks") and 7f88c6b2 ("ipv6: fix possible seqlock
      deadlock in ip6_finish_output2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      bf3da416
    • Sven Eckelmann's avatar
      batman-adv: Fix out-of-order fragmentation support · 11e9b8e6
      Sven Eckelmann authored
      [ Upstream commit d9124268 ]
      
      batadv_frag_insert_packet was unable to handle out-of-order packets because it
      dropped them directly. This is caused by the way the fragmentation lists is
      checked for the correct place to insert a fragmentation entry.
      
      The fragmentation code keeps the fragments in lists. The fragmentation entries
      are kept in descending order of sequence number. The list is traversed and each
      entry is compared with the new fragment. If the current entry has a smaller
      sequence number than the new fragment then the new one has to be inserted
      before the current entry. This ensures that the list is still in descending
      order.
      
      An out-of-order packet with a smaller sequence number than all entries in the
      list still has to be added to the end of the list. The used hlist has no
      information about the last entry in the list inside hlist_head and thus the
      last entry has to be calculated differently. Currently the code assumes that
      the iterator variable of hlist_for_each_entry can be used for this purpose
      after the hlist_for_each_entry finished. This is obviously wrong because the
      iterator variable is always NULL when the list was completely traversed.
      
      Instead the information about the last entry has to be stored in a different
      variable.
      
      This problem was introduced in 610bfc6b
      ("batman-adv: Receive fragmented packets and merge").
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      11e9b8e6
    • Sasha Levin's avatar
      iovec: make sure the caller actually wants anything in memcpy_fromiovecend · 0afe6f30
      Sasha Levin authored
      [ Upstream commit 06ebb06d ]
      
      Check for cases when the caller requests 0 bytes instead of running off
      and dereferencing potentially invalid iovecs.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      0afe6f30
    • Vlad Yasevich's avatar
      net: Correctly set segment mac_len in skb_segment(). · bf2c65d8
      Vlad Yasevich authored
      [ Upstream commit fcdfe3a7 ]
      
      When performing segmentation, the mac_len value is copied right
      out of the original skb.  However, this value is not always set correctly
      (like when the packet is VLAN-tagged) and we'll end up copying a bad
      value.
      
      One way to demonstrate this is to configure a VM which tags
      packets internally and turn off VLAN acceleration on the forwarding
      bridge port.  The packets show up corrupt like this:
      16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q
      (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
              0x0000:  8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d.
              0x0010:  0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0..
              0x0020:  29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3
              0x0030:  000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
              0x0040:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              0x0050:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              0x0060:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              ...
      
      This also leads to awful throughput as GSO packets are dropped and
      cause retransmissions.
      
      The solution is to set the mac_len using the values already available
      in then new skb.  We've already adjusted all of the header offset, so we
      might as well correctly figure out the mac_len using skb_reset_mac_len().
      After this change, packets are segmented correctly and performance
      is restored.
      
      CC: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      bf2c65d8
    • Vlad Yasevich's avatar
      macvlan: Initialize vlan_features to turn on offload support. · 0a6b7256
      Vlad Yasevich authored
      [ Upstream commit 081e83a7 ]
      
      Macvlan devices do not initialize vlan_features.  As a result,
      any vlan devices configured on top of macvlans perform very poorly.
      Initialize vlan_features based on the vlan features of the lower-level
      device.
      Signed-off-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      0a6b7256
    • Christoph Paasch's avatar
      tcp: Fix integer-overflow in TCP vegas · c3387955
      Christoph Paasch authored
      [ Upstream commit 1f74e613 ]
      
      In vegas we do a multiplication of the cwnd and the rtt. This
      may overflow and thus their result is stored in a u64. However, we first
      need to cast the cwnd so that actually 64-bit arithmetic is done.
      
      Then, we need to do do_div to allow this to be used on 32-bit arches.
      
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Doug Leith <doug.leith@nuim.ie>
      Fixes: 8d3a564d (tcp: tcp_vegas cong avoid fix)
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      c3387955
    • Christoph Paasch's avatar
      tcp: Fix integer-overflows in TCP veno · 04e4957d
      Christoph Paasch authored
      [ Upstream commit 45a07695 ]
      
      In veno we do a multiplication of the cwnd and the rtt. This
      may overflow and thus their result is stored in a u64. However, we first
      need to cast the cwnd so that actually 64-bit arithmetic is done.
      
      A first attempt at fixing 76f10177 ([TCP]: TCP Veno congestion
      control) was made by 15913114 (tcp: Overflow bug in Vegas), but it
      failed to add the required cast in tcp_veno_cong_avoid().
      
      Fixes: 76f10177 ([TCP]: TCP Veno congestion control)
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      04e4957d
    • Eric Dumazet's avatar
      ip: make IP identifiers less predictable · 357ca089
      Eric Dumazet authored
      [ Upstream commit 04ca6973 ]
      
      In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
      Jedidiah describe ways exploiting linux IP identifier generation to
      infer whether two machines are exchanging packets.
      
      With commit 73f156a6 ("inetpeer: get rid of ip_id_count"), we
      changed IP id generation, but this does not really prevent this
      side-channel technique.
      
      This patch adds a random amount of perturbation so that IP identifiers
      for a given destination [1] are no longer monotonically increasing after
      an idle period.
      
      Note that prandom_u32_max(1) returns 0, so if generator is used at most
      once per jiffy, this patch inserts no hole in the ID suite and do not
      increase collision probability.
      
      This is jiffies based, so in the worst case (HZ=1000), the id can
      rollover after ~65 seconds of idle time, which should be fine.
      
      We also change the hash used in __ip_select_ident() to not only hash
      on daddr, but also saddr and protocol, so that ICMP probes can not be
      used to infer information for other protocols.
      
      For IPv6, adds saddr into the hash as well, but not nexthdr.
      
      If I ping the patched target, we can see ID are now hard to predict.
      
      21:57:11.008086 IP (...)
          A > target: ICMP echo request, seq 1, length 64
      21:57:11.010752 IP (... id 2081 ...)
          target > A: ICMP echo reply, seq 1, length 64
      
      21:57:12.013133 IP (...)
          A > target: ICMP echo request, seq 2, length 64
      21:57:12.015737 IP (... id 3039 ...)
          target > A: ICMP echo reply, seq 2, length 64
      
      21:57:13.016580 IP (...)
          A > target: ICMP echo request, seq 3, length 64
      21:57:13.019251 IP (... id 3437 ...)
          target > A: ICMP echo reply, seq 3, length 64
      
      [1] TCP sessions uses a per flow ID generator not changed by this patch.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJeffrey Knockel <jeffk@cs.unm.edu>
      Reported-by: default avatarJedidiah R. Crandall <crandall@cs.unm.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hannes Frederic Sowa <hannes@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      357ca089
    • Eric Dumazet's avatar
      inetpeer: get rid of ip_id_count · a7d543ba
      Eric Dumazet authored
      [ Upstream commit 73f156a6 ]
      
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      a7d543ba
    • Dmitry Kravkov's avatar
      bnx2x: fix crash during TSO tunneling · 478cc9ad
      Dmitry Kravkov authored
      [ Upstream commit fe26566d ]
      
      When TSO packet is transmitted additional BD w/o mapping is used
      to describe the packed. The BD needs special handling in tx
      completion.
      
      kernel: Call Trace:
      kernel: <IRQ>  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
      kernel: [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
      kernel: [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80
      kernel: [<ffffffff814a8c0d>] ? find_iova+0x4d/0x90
      kernel: [<ffffffff814ab0e2>] intel_unmap_page.part.36+0x142/0x160
      kernel: [<ffffffff814ad0e6>] intel_unmap_page+0x26/0x30
      kernel: [<ffffffffa01f55d7>] bnx2x_free_tx_pkt+0x157/0x2b0 [bnx2x]
      kernel: [<ffffffffa01f8dac>] bnx2x_tx_int+0xac/0x220 [bnx2x]
      kernel: [<ffffffff8101a0d9>] ? read_tsc+0x9/0x20
      kernel: [<ffffffffa01f8fdb>] bnx2x_poll+0xbb/0x3c0 [bnx2x]
      kernel: [<ffffffff814d041a>] net_rx_action+0x15a/0x250
      kernel: [<ffffffff81067047>] __do_softirq+0xf7/0x290
      kernel: [<ffffffff815f3a5c>] call_softirq+0x1c/0x30
      kernel: [<ffffffff81014d25>] do_softirq+0x55/0x90
      kernel: [<ffffffff810673e5>] irq_exit+0x115/0x120
      kernel: [<ffffffff815f4358>] do_IRQ+0x58/0xf0
      kernel: [<ffffffff815e94ad>] common_interrupt+0x6d/0x6d
      kernel: <EOI>  [<ffffffff810bbff7>] ? clockevents_notify+0x127/0x140
      kernel: [<ffffffff814834df>] ? cpuidle_enter_state+0x4f/0xc0
      kernel: [<ffffffff81483615>] cpuidle_idle_call+0xc5/0x200
      kernel: [<ffffffff8101bc7e>] arch_cpu_idle+0xe/0x30
      kernel: [<ffffffff810b4725>] cpu_startup_entry+0xf5/0x290
      kernel: [<ffffffff815cfee1>] start_secondary+0x265/0x27b
      kernel: ---[ end trace 11aa7726f18d7e80 ]---
      
      Fixes: a848ade4 ("bnx2x: add CSUM and TSO support for encapsulation protocols")
      Reported-by: default avatarYulong Pei <ypei@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Signed-off-by: default avatarDmitry Kravkov <Dmitry.Kravkov@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      478cc9ad
    • Michael S. Tsirkin's avatar
      kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601) · e5e33c3f
      Michael S. Tsirkin authored
      commit 350b8bdd upstream.
      
      The third parameter of kvm_iommu_put_pages is wrong,
      It should be 'gfn - slot->base_gfn'.
      
      By making gfn very large, malicious guest or userspace can cause kvm to
      go to this error path, and subsequently to pass a huge value as size.
      Alternatively if gfn is small, then pages would be pinned but never
      unpinned, causing host memory leak and local DOS.
      
      Passing a reasonable but large value could be the most dangerous case,
      because it would unpin a page that should have stayed pinned, and thus
      allow the device to DMA into arbitrary memory.  However, this cannot
      happen because of the condition that can trigger the error:
      
      - out of memory (where you can't allocate even a single page)
        should not be possible for the attacker to trigger
      
      - when exceeding the iommu's address space, guest pages after gfn
        will also exceed the iommu's address space, and inside
        kvm_iommu_put_pages() the iommu_iova_to_phys() will fail.  The
        page thus would not be unpinned at all.
      Reported-by: default avatarJack Morgenstein <jackm@mellanox.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      e5e33c3f
  2. 15 Aug, 2014 1 commit
    • Daniel Borkmann's avatar
      net: sctp: inherit auth_capable on INIT collisions · f730178b
      Daniel Borkmann authored
      commit 1be9a950 upstream.
      
      Jason reported an oops caused by SCTP on his ARM machine with
      SCTP authentication enabled:
      
      Internal error: Oops: 17 [#1] ARM
      CPU: 0 PID: 104 Comm: sctp-test Not tainted 3.13.0-68744-g3632f30c9b20-dirty #1
      task: c6eefa40 ti: c6f52000 task.ti: c6f52000
      PC is at sctp_auth_calculate_hmac+0xc4/0x10c
      LR is at sg_init_table+0x20/0x38
      pc : [<c024bb80>]    lr : [<c00f32dc>]    psr: 40000013
      sp : c6f538e8  ip : 00000000  fp : c6f53924
      r10: c6f50d80  r9 : 00000000  r8 : 00010000
      r7 : 00000000  r6 : c7be4000  r5 : 00000000  r4 : c6f56254
      r3 : c00c8170  r2 : 00000001  r1 : 00000008  r0 : c6f1e660
      Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
      Control: 0005397f  Table: 06f28000  DAC: 00000015
      Process sctp-test (pid: 104, stack limit = 0xc6f521c0)
      Stack: (0xc6f538e8 to 0xc6f54000)
      [...]
      Backtrace:
      [<c024babc>] (sctp_auth_calculate_hmac+0x0/0x10c) from [<c0249af8>] (sctp_packet_transmit+0x33c/0x5c8)
      [<c02497bc>] (sctp_packet_transmit+0x0/0x5c8) from [<c023e96c>] (sctp_outq_flush+0x7fc/0x844)
      [<c023e170>] (sctp_outq_flush+0x0/0x844) from [<c023ef78>] (sctp_outq_uncork+0x24/0x28)
      [<c023ef54>] (sctp_outq_uncork+0x0/0x28) from [<c0234364>] (sctp_side_effects+0x1134/0x1220)
      [<c0233230>] (sctp_side_effects+0x0/0x1220) from [<c02330b0>] (sctp_do_sm+0xac/0xd4)
      [<c0233004>] (sctp_do_sm+0x0/0xd4) from [<c023675c>] (sctp_assoc_bh_rcv+0x118/0x160)
      [<c0236644>] (sctp_assoc_bh_rcv+0x0/0x160) from [<c023d5bc>] (sctp_inq_push+0x6c/0x74)
      [<c023d550>] (sctp_inq_push+0x0/0x74) from [<c024a6b0>] (sctp_rcv+0x7d8/0x888)
      
      While we already had various kind of bugs in that area
      ec0223ec ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if
      we/peer is AUTH capable") and b14878cc ("net: sctp: cache
      auth_enable per endpoint"), this one is a bit of a different
      kind.
      
      Giving a bit more background on why SCTP authentication is
      needed can be found in RFC4895:
      
        SCTP uses 32-bit verification tags to protect itself against
        blind attackers. These values are not changed during the
        lifetime of an SCTP association.
      
        Looking at new SCTP extensions, there is the need to have a
        method of proving that an SCTP chunk(s) was really sent by
        the original peer that started the association and not by a
        malicious attacker.
      
      To cause this bug, we're triggering an INIT collision between
      peers; normal SCTP handshake where both sides intent to
      authenticate packets contains RANDOM; CHUNKS; HMAC-ALGO
      parameters that are being negotiated among peers:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
      
      RFC4895 says that each endpoint therefore knows its own random
      number and the peer's random number *after* the association
      has been established. The local and peer's random number along
      with the shared key are then part of the secret used for
      calculating the HMAC in the AUTH chunk.
      
      Now, in our scenario, we have 2 threads with 1 non-blocking
      SEQ_PACKET socket each, setting up common shared SCTP_AUTH_KEY
      and SCTP_AUTH_ACTIVE_KEY properly, and each of them calling
      sctp_bindx(3), listen(2) and connect(2) against each other,
      thus the handshake looks similar to this, e.g.:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        <--------- INIT[RANDOM; CHUNKS; HMAC-ALGO] -----------
        -------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] -------->
        ...
      
      Since such collisions can also happen with verification tags,
      the RFC4895 for AUTH rather vaguely says under section 6.1:
      
        In case of INIT collision, the rules governing the handling
        of this Random Number follow the same pattern as those for
        the Verification Tag, as explained in Section 5.2.4 of
        RFC 2960 [5]. Therefore, each endpoint knows its own Random
        Number and the peer's Random Number after the association
        has been established.
      
      In RFC2960, section 5.2.4, we're eventually hitting Action B:
      
        B) In this case, both sides may be attempting to start an
           association at about the same time but the peer endpoint
           started its INIT after responding to the local endpoint's
           INIT. Thus it may have picked a new Verification Tag not
           being aware of the previous Tag it had sent this endpoint.
           The endpoint should stay in or enter the ESTABLISHED
           state but it MUST update its peer's Verification Tag from
           the State Cookie, stop any init or cookie timers that may
           running and send a COOKIE ACK.
      
      In other words, the handling of the Random parameter is the
      same as behavior for the Verification Tag as described in
      Action B of section 5.2.4.
      
      Looking at the code, we exactly hit the sctp_sf_do_dupcook_b()
      case which triggers an SCTP_CMD_UPDATE_ASSOC command to the
      side effect interpreter, and in fact it properly copies over
      peer_{random, hmacs, chunks} parameters from the newly created
      association to update the existing one.
      
      Also, the old asoc_shared_key is being released and based on
      the new params, sctp_auth_asoc_init_active_key() updated.
      However, the issue observed in this case is that the previous
      asoc->peer.auth_capable was 0, and has *not* been updated, so
      that instead of creating a new secret, we're doing an early
      return from the function sctp_auth_asoc_init_active_key()
      leaving asoc->asoc_shared_key as NULL. However, we now have to
      authenticate chunks from the updated chunk list (e.g. COOKIE-ACK).
      
      That in fact causes the server side when responding with ...
      
        <------------------ AUTH; COOKIE-ACK -----------------
      
      ... to trigger a NULL pointer dereference, since in
      sctp_packet_transmit(), it discovers that an AUTH chunk is
      being queued for xmit, and thus it calls sctp_auth_calculate_hmac().
      
      Since the asoc->active_key_id is still inherited from the
      endpoint, and the same as encoded into the chunk, it uses
      asoc->asoc_shared_key, which is still NULL, as an asoc_key
      and dereferences it in ...
      
        crypto_hash_setkey(desc.tfm, &asoc_key->data[0], asoc_key->len)
      
      ... causing an oops. All this happens because sctp_make_cookie_ack()
      called with the *new* association has the peer.auth_capable=1
      and therefore marks the chunk with auth=1 after checking
      sctp_auth_send_cid(), but it is *actually* sent later on over
      the then *updated* association's transport that didn't initialize
      its shared key due to peer.auth_capable=0. Since control chunks
      in that case are not sent by the temporary association which
      are scheduled for deletion, they are issued for xmit via
      SCTP_CMD_REPLY in the interpreter with the context of the
      *updated* association. peer.auth_capable was 0 in the updated
      association (which went from COOKIE_WAIT into ESTABLISHED state),
      since all previous processing that performed sctp_process_init()
      was being done on temporary associations, that we eventually
      throw away each time.
      
      The correct fix is to update to the new peer.auth_capable
      value as well in the collision case via sctp_assoc_update(),
      so that in case the collision migrated from 0 -> 1,
      sctp_auth_asoc_init_active_key() can properly recalculate
      the secret. This therefore fixes the observed server panic.
      
      Fixes: 730fc3d0 ("[SCTP]: Implete SCTP-AUTH parameter processing")
      Reported-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Tested-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      CVE-2014-5077
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      f730178b
  3. 13 Aug, 2014 5 commits
    • Kamal Mostafa's avatar
      Linux 3.13.11.6 · 39302648
      Kamal Mostafa authored
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      39302648
    • Eric W. Biederman's avatar
      mnt: Change the default remount atime from relatime to the existing value · 78c573c8
      Eric W. Biederman authored
      commit ffbc6f0e upstream.
      
      Since March 2009 the kernel has treated the state that if no
      MS_..ATIME flags are passed then the kernel defaults to relatime.
      
      Defaulting to relatime instead of the existing atime state during a
      remount is silly, and causes problems in practice for people who don't
      specify any MS_...ATIME flags and to get the default filesystem atime
      setting.  Those users may encounter a permission error because the
      default atime setting does not work.
      
      A default that does not work and causes permission problems is
      ridiculous, so preserve the existing value to have a default
      atime setting that is always guaranteed to work.
      
      Using the default atime setting in this way is particularly
      interesting for applications built to run in restricted userspace
      environments without /proc mounted, as the existing atime mount
      options of a filesystem can not be read from /proc/mounts.
      
      In practice this fixes user space that uses the default atime
      setting on remount that are broken by the permission checks
      keeping less privileged users from changing more privileged users
      atime settings.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      78c573c8
    • Eric W. Biederman's avatar
      mnt: Correct permission checks in do_remount · 27bec336
      Eric W. Biederman authored
      commit 9566d674 upstream.
      
      While invesgiating the issue where in "mount --bind -oremount,ro ..."
      would result in later "mount --bind -oremount,rw" succeeding even if
      the mount started off locked I realized that there are several
      additional mount flags that should be locked and are not.
      
      In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
      flags in addition to MNT_READONLY should all be locked.  These
      flags are all per superblock, can all be changed with MS_BIND,
      and should not be changable if set by a more privileged user.
      
      The following additions to the current logic are added in this patch.
      - nosuid may not be clearable by a less privileged user.
      - nodev  may not be clearable by a less privielged user.
      - noexec may not be clearable by a less privileged user.
      - atime flags may not be changeable by a less privileged user.
      
      The logic with atime is that always setting atime on access is a
      global policy and backup software and auditing software could break if
      atime bits are not updated (when they are configured to be updated),
      and serious performance degradation could result (DOS attack) if atime
      updates happen when they have been explicitly disabled.  Therefore an
      unprivileged user should not be able to mess with the atime bits set
      by a more privileged user.
      
      The additional restrictions are implemented with the addition of
      MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
      mnt flags.
      
      Taken together these changes and the fixes for MNT_LOCK_READONLY
      should make it safe for an unprivileged user to create a user
      namespace and to call "mount --bind -o remount,... ..." without
      the danger of mount flags being changed maliciously.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      27bec336
    • Eric W. Biederman's avatar
      mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount · 990c710d
      Eric W. Biederman authored
      commit 07b64558 upstream.
      
      There are no races as locked mount flags are guaranteed to never change.
      
      Moving the test into do_remount makes it more visible, and ensures all
      filesystem remounts pass the MNT_LOCK_READONLY permission check.  This
      second case is not an issue today as filesystem remounts are guarded
      by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
      mount namespaces, but it could become an issue in the future.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      990c710d
    • Eric W. Biederman's avatar
      mnt: Only change user settable mount flags in remount · 2890802f
      Eric W. Biederman authored
      commit a6138db8 upstream.
      
      Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
      read-only bind mount read-only in a user namespace the
      MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
      to the remount a read-only mount read-write.
      
      Correct this by replacing the mask of mount flags to preserve
      with a mask of mount flags that may be changed, and preserve
      all others.   This ensures that any future bugs with this mask and
      remount will fail in an easy to detect way where new mount flags
      simply won't change.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      2890802f
  4. 12 Aug, 2014 1 commit
  5. 08 Aug, 2014 13 commits
    • Eric Dumazet's avatar
      ipv4: fix buffer overflow in ip_options_compile() · 7a662154
      Eric Dumazet authored
      [ Upstream commit 10ec9472 ]
      
      There is a benign buffer overflow in ip_options_compile spotted by
      AddressSanitizer[1] :
      
      Its benign because we always can access one extra byte in skb->head
      (because header is followed by struct skb_shared_info), and in this case
      this byte is not even used.
      
      [28504.910798] ==================================================================
      [28504.912046] AddressSanitizer: heap-buffer-overflow in ip_options_compile
      [28504.913170] Read of size 1 by thread T15843:
      [28504.914026]  [<ffffffff81802f91>] ip_options_compile+0x121/0x9c0
      [28504.915394]  [<ffffffff81804a0d>] ip_options_get_from_user+0xad/0x120
      [28504.916843]  [<ffffffff8180dedf>] do_ip_setsockopt.isra.15+0x8df/0x1630
      [28504.918175]  [<ffffffff8180ec60>] ip_setsockopt+0x30/0xa0
      [28504.919490]  [<ffffffff8181e59b>] tcp_setsockopt+0x5b/0x90
      [28504.920835]  [<ffffffff8177462f>] sock_common_setsockopt+0x5f/0x70
      [28504.922208]  [<ffffffff817729c2>] SyS_setsockopt+0xa2/0x140
      [28504.923459]  [<ffffffff818cfb69>] system_call_fastpath+0x16/0x1b
      [28504.924722]
      [28504.925106] Allocated by thread T15843:
      [28504.925815]  [<ffffffff81804995>] ip_options_get_from_user+0x35/0x120
      [28504.926884]  [<ffffffff8180dedf>] do_ip_setsockopt.isra.15+0x8df/0x1630
      [28504.927975]  [<ffffffff8180ec60>] ip_setsockopt+0x30/0xa0
      [28504.929175]  [<ffffffff8181e59b>] tcp_setsockopt+0x5b/0x90
      [28504.930400]  [<ffffffff8177462f>] sock_common_setsockopt+0x5f/0x70
      [28504.931677]  [<ffffffff817729c2>] SyS_setsockopt+0xa2/0x140
      [28504.932851]  [<ffffffff818cfb69>] system_call_fastpath+0x16/0x1b
      [28504.934018]
      [28504.934377] The buggy address ffff880026382828 is located 0 bytes to the right
      [28504.934377]  of 40-byte region [ffff880026382800, ffff880026382828)
      [28504.937144]
      [28504.937474] Memory state around the buggy address:
      [28504.938430]  ffff880026382300: ........ rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.939884]  ffff880026382400: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.941294]  ffff880026382500: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.942504]  ffff880026382600: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.943483]  ffff880026382700: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.944511] >ffff880026382800: .....rrr rrrrrrrr rrrrrrrr rrrrrrrr
      [28504.945573]                         ^
      [28504.946277]  ffff880026382900: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.094949]  ffff880026382a00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.096114]  ffff880026382b00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.097116]  ffff880026382c00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.098472]  ffff880026382d00: ffffffff rrrrrrrr rrrrrrrr rrrrrrrr
      [28505.099804] Legend:
      [28505.100269]  f - 8 freed bytes
      [28505.100884]  r - 8 redzone bytes
      [28505.101649]  . - 8 allocated bytes
      [28505.102406]  x=1..7 - x allocated bytes + (8-x) redzone bytes
      [28505.103637] ==================================================================
      
      [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernelSigned-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      7a662154
    • Ben Hutchings's avatar
      dns_resolver: Null-terminate the right string · 92eaebf5
      Ben Hutchings authored
      [ Upstream commit 640d7efe ]
      
      *_result[len] is parsed as *(_result[len]) which is not at all what we
      want to touch here.
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Fixes: 84a7c0b1 ("dns_resolver: assure that dns_query() result is null-terminated")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      92eaebf5
    • Manuel Schölling's avatar
      dns_resolver: assure that dns_query() result is null-terminated · 34a68a57
      Manuel Schölling authored
      [ Upstream commit 84a7c0b1 ]
      
      dns_query() credulously assumes that keys are null-terminated and
      returns a copy of a memory block that is off by one.
      Signed-off-by: default avatarManuel Schölling <manuel.schoelling@gmx.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      34a68a57
    • Bjørn Mork's avatar
      net: huawei_cdc_ncm: add "subclass 3" devices · 37ed63bf
      Bjørn Mork authored
      [ Upstream commit c2a6c781 ]
      
      Huawei's usage of the subclass and protocol fields is not 100%
      clear to us, but there appears to be a very strict system.
      
      A device with the "shared" device ID 12d1:1506 and this NCM
      function was recently reported (showing only default altsetting):
      
          Interface Descriptor:
            bLength                 9
            bDescriptorType         4
            bInterfaceNumber        1
            bAlternateSetting       0
            bNumEndpoints           1
            bInterfaceClass       255 Vendor Specific Class
            bInterfaceSubClass      3
            bInterfaceProtocol     22
            iInterface              8 CDC Network Control Model (NCM)
            ** UNRECOGNIZED:  05 24 00 10 01
            ** UNRECOGNIZED:  06 24 1a 00 01 1f
            ** UNRECOGNIZED:  0c 24 1b 00 01 00 04 10 14 dc 05 20
            ** UNRECOGNIZED:  0d 24 0f 0a 0f 00 00 00 ea 05 03 00 01
            ** UNRECOGNIZED:  05 24 06 01 01
            Endpoint Descriptor:
              bLength                 7
              bDescriptorType         5
              bEndpointAddress     0x85  EP 5 IN
              bmAttributes            3
                Transfer Type            Interrupt
                Synch Type               None
                Usage Type               Data
              wMaxPacketSize     0x0010  1x 16 bytes
              bInterval               9
      
      Cc: Enrico Mioso <mrkiko.rs@gmail.com>
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      37ed63bf
    • Sowmini Varadhan's avatar
      sunvnet: clean up objects created in vnet_new() on vnet_exit() · 7584b928
      Sowmini Varadhan authored
      [ Upstream commit a4b70a07 ]
      
      Nothing cleans up the objects created by
      vnet_new(), they are completely leaked.
      
      vnet_exit(), after doing the vio_unregister_driver() to clean
      up ports, should call a helper function that iterates over vnet_list
      and cleans up those objects. This includes unregister_netdevice()
      as well as free_netdev().
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Reviewed-by: default avatarKarl Volz <karl.volz@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      7584b928
    • Christoph Schulz's avatar
      net: pppoe: use correct channel MTU when using Multilink PPP · e3e9b78e
      Christoph Schulz authored
      [ Upstream commit a8a3e41c ]
      
      The PPP channel MTU is used with Multilink PPP when ppp_mp_explode() (see
      ppp_generic module) tries to determine how big a fragment might be. According
      to RFC 1661, the MTU excludes the 2-byte PPP protocol field, see the
      corresponding comment and code in ppp_mp_explode():
      
      		/*
      		 * hdrlen includes the 2-byte PPP protocol field, but the
      		 * MTU counts only the payload excluding the protocol field.
      		 * (RFC1661 Section 2)
      		 */
      		mtu = pch->chan->mtu - (hdrlen - 2);
      
      However, the pppoe module *does* include the PPP protocol field in the channel
      MTU, which is wrong as it causes the PPP payload to be 1-2 bytes too big under
      certain circumstances (one byte if PPP protocol compression is used, two
      otherwise), causing the generated Ethernet packets to be dropped. So the pppoe
      module has to subtract two bytes from the channel MTU. This error only
      manifests itself when using Multilink PPP, as otherwise the channel MTU is not
      used anywhere.
      
      In the following, I will describe how to reproduce this bug. We configure two
      pppd instances for multilink PPP over two PPPoE links, say eth2 and eth3, with
      a MTU of 1492 bytes for each link and a MRRU of 2976 bytes. (This MRRU is
      computed by adding the two link MTUs and subtracting the MP header twice, which
      is 4 bytes long.) The necessary pppd statements on both sides are "multilink
      mtu 1492 mru 1492 mrru 2976". On the client side, we additionally need "plugin
      rp-pppoe.so eth2" and "plugin rp-pppoe.so eth3", respectively; on the server
      side, we additionally need to start two pppoe-server instances to be able to
      establish two PPPoE sessions, one over eth2 and one over eth3. We set the MTU
      of the PPP network interface to the MRRU (2976) on both sides of the connection
      in order to make use of the higher bandwidth. (If we didn't do that, IP
      fragmentation would kick in, which we want to avoid.)
      
      Now we send a ICMPv4 echo request with a payload of 2948 bytes from client to
      server over the PPP link. This results in the following network packet:
      
         2948 (echo payload)
       +    8 (ICMPv4 header)
       +   20 (IPv4 header)
      ---------------------
         2976 (PPP payload)
      
      These 2976 bytes do not exceed the MTU of the PPP network interface, so the
      IP packet is not fragmented. Now the multilink PPP code in ppp_mp_explode()
      prepends one protocol byte (0x21 for IPv4), making the packet one byte bigger
      than the negotiated MRRU. So this packet would have to be divided in three
      fragments. But this does not happen as each link MTU is assumed to be two bytes
      larger. So this packet is diveded into two fragments only, one of size 1489 and
      one of size 1488. Now we have for that bigger fragment:
      
         1489 (PPP payload)
       +    4 (MP header)
       +    2 (PPP protocol field for the MP payload (0x3d))
       +    6 (PPPoE header)
      --------------------------
         1501 (Ethernet payload)
      
      This packet exceeds the link MTU and is discarded.
      
      If one configures the link MTU on the client side to 1501, one can see the
      discarded Ethernet frames with tcpdump running on the client. A
      
      ping -s 2948 -c 1 192.168.15.254
      
      leads to the smaller fragment that is correctly received on the server side:
      
      (tcpdump -vvvne -i eth3 pppoes and ppp proto 0x3d)
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x3] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [end], length 1492
      
      and to the bigger fragment that is not received on the server side:
      
      (tcpdump -vvvne -i eth2 pppoes and ppp proto 0x3d)
      52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
        length 1515: PPPoE  [ses 0x5] MLPPP (0x003d), length 1495: seq 0x000,
        Flags [begin], length 1493
      
      With the patch below, we correctly obtain three fragments:
      
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [begin], length 1492
      52:54:00:70:9e:89 > 52:54:00:5d:6f:b0, ethertype PPPoE S (0x8864),
        length 1514: PPPoE  [ses 0x1] MLPPP (0x003d), length 1494: seq 0x000,
        Flags [none], length 1492
      52:54:00:ad:87:fd > 52:54:00:79:5c:d0, ethertype PPPoE S (0x8864),
        length 27: PPPoE  [ses 0x1] MLPPP (0x003d), length 7: seq 0x000,
        Flags [end], length 5
      
      And the ICMPv4 echo request is successfully received at the server side:
      
      IP (tos 0x0, ttl 64, id 21925, offset 0, flags [DF], proto ICMP (1),
        length 2976)
          192.168.222.2 > 192.168.15.254: ICMP echo request, id 30530, seq 0,
            length 2956
      
      The bug was introduced in commit c9aa6895
      ("[PPPOE]: Advertise PPPoE MTU") from the very beginning. This patch applies
      to 3.10 upwards but the fix can be applied (with minor modifications) to
      kernels as old as 2.6.32.
      Signed-off-by: default avatarChristoph Schulz <develop@kristov.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      e3e9b78e
    • Daniel Borkmann's avatar
      net: sctp: fix information leaks in ulpevent layer · f447e928
      Daniel Borkmann authored
      [ Upstream commit 8f2e5ae4 ]
      
      While working on some other SCTP code, I noticed that some
      structures shared with user space are leaking uninitialized
      stack or heap buffer. In particular, struct sctp_sndrcvinfo
      has a 2 bytes hole between .sinfo_flags and .sinfo_ppid that
      remains unfilled by us in sctp_ulpevent_read_sndrcvinfo() when
      putting this into cmsg. But also struct sctp_remote_error
      contains a 2 bytes hole that we don't fill but place into a skb
      through skb_copy_expand() via sctp_ulpevent_make_remote_error().
      
      Both structures are defined by the IETF in RFC6458:
      
      * Section 5.3.2. SCTP Header Information Structure:
      
        The sctp_sndrcvinfo structure is defined below:
      
        struct sctp_sndrcvinfo {
          uint16_t sinfo_stream;
          uint16_t sinfo_ssn;
          uint16_t sinfo_flags;
          <-- 2 bytes hole  -->
          uint32_t sinfo_ppid;
          uint32_t sinfo_context;
          uint32_t sinfo_timetolive;
          uint32_t sinfo_tsn;
          uint32_t sinfo_cumtsn;
          sctp_assoc_t sinfo_assoc_id;
        };
      
      * 6.1.3. SCTP_REMOTE_ERROR:
      
        A remote peer may send an Operation Error message to its peer.
        This message indicates a variety of error conditions on an
        association. The entire ERROR chunk as it appears on the wire
        is included in an SCTP_REMOTE_ERROR event. Please refer to the
        SCTP specification [RFC4960] and any extensions for a list of
        possible error formats. An SCTP error notification has the
        following format:
      
        struct sctp_remote_error {
          uint16_t sre_type;
          uint16_t sre_flags;
          uint32_t sre_length;
          uint16_t sre_error;
          <-- 2 bytes hole  -->
          sctp_assoc_t sre_assoc_id;
          uint8_t  sre_data[];
        };
      
      Fix this by setting both to 0 before filling them out. We also
      have other structures shared between user and kernel space in
      SCTP that contains holes (e.g. struct sctp_paddrthlds), but we
      copy that buffer over from user space first and thus don't need
      to care about it in that cases.
      
      While at it, we can also remove lengthy comments copied from
      the draft, instead, we update the comment with the correct RFC
      number where one can look it up.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      f447e928
    • Jon Paul Maloy's avatar
      tipc: clear 'next'-pointer of message fragments before reassembly · 775e2b35
      Jon Paul Maloy authored
      [ Upstream commit 99941754 ]
      
      If the 'next' pointer of the last fragment buffer in a message is not
      zeroed before reassembly, we risk ending up with a corrupt message,
      since the reassembly function itself isn't doing this.
      
      Currently, when a buffer is retrieved from the deferred queue of the
      broadcast link, the next pointer is not cleared, with the result as
      described above.
      
      This commit corrects this, and thereby fixes a bug that may occur when
      long broadcast messages are transmitted across dual interfaces. The bug
      has been present since 40ba3cdf ("tipc:
      message reassembly using fragment chain")
      
      This commit should be applied to both net and net-next.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      775e2b35
    • Suresh Reddy's avatar
      be2net: set EQ DB clear-intr bit in be_open() · 8894879e
      Suresh Reddy authored
      [ Upstream commit 4cad9f3b ]
      
      On BE3, if the clear-interrupt bit of the EQ doorbell is not set the first
      time it is armed, ocassionally we have observed that the EQ doesn't raise
      anymore interrupts even if it is in armed state.
      This patch fixes this by setting the clear-interrupt bit when EQs are
      armed for the first time in be_open().
      Signed-off-by: default avatarSuresh Reddy <Suresh.Reddy@emulex.com>
      Signed-off-by: default avatarSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      8894879e
    • Ben Pfaff's avatar
      netlink: Fix handling of error from netlink_dump(). · 8a430b14
      Ben Pfaff authored
      [ Upstream commit ac30ef83 ]
      
      netlink_dump() returns a negative errno value on error.  Until now,
      netlink_recvmsg() directly recorded that negative value in sk->sk_err, but
      that's wrong since sk_err takes positive errno values.  (This manifests as
      userspace receiving a positive return value from the recv() system call,
      falsely indicating success.) This bug was introduced in the commit that
      started checking the netlink_dump() return value, commit b44d211e (netlink:
      handle errors from netlink_dump()).
      
      Multithreaded Netlink dumps are one way to trigger this behavior in
      practice, as described in the commit message for the userspace workaround
      posted here:
          http://openvswitch.org/pipermail/dev/2014-June/042339.html
      
      This commit also fixes the same bug in netlink_poll(), introduced in commit
      cd1df525 (netlink: add flow control for memory mapped I/O).
      Signed-off-by: default avatarBen Pfaff <blp@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      8a430b14
    • Andrey Utkin's avatar
      appletalk: Fix socket referencing in skb · 1c385658
      Andrey Utkin authored
      [ Upstream commit 36beddc2 ]
      
      Setting just skb->sk without taking its reference and setting a
      destructor is invalid. However, in the places where this was done, skb
      is used in a way not requiring skb->sk setting. So dropping the setting
      of skb->sk.
      Thanks to Eric Dumazet <eric.dumazet@gmail.com> for correct solution.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=79441Reported-by: default avatarEd Martin <edman007@edman007.com>
      Signed-off-by: default avatarAndrey Utkin <andrey.krieger.utkin@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      1c385658
    • Yuchung Cheng's avatar
      tcp: fix false undo corner cases · 1819d324
      Yuchung Cheng authored
      [ Upstream commit 6e08d5e3 ]
      
      The undo code assumes that, upon entering loss recovery, TCP
      1) always retransmit something
      2) the retransmission never fails locally (e.g., qdisc drop)
      
      so undo_marker is set in tcp_enter_recovery() and undo_retrans is
      incremented only when tcp_retransmit_skb() is successful.
      
      When the assumption is broken because TCP's cwnd is too small to
      retransmit or the retransmit fails locally. The next (DUP)ACK
      would incorrectly revert the cwnd and the congestion state in
      tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
      may enter the recovery state. The sender repeatedly enter and
      (incorrectly) exit recovery states if the retransmits continue to
      fail locally while receiving (DUP)ACKs.
      
      The fix is to initialize undo_retrans to -1 and start counting on
      the first retransmission. Always increment undo_retrans even if the
      retransmissions fail locally because they couldn't cause DSACKs to
      undo the cwnd reduction.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      1819d324
    • dingtianhong's avatar
      igmp: fix the problem when mc leave group · b0b31561
      dingtianhong authored
      [ Upstream commit 52ad353a ]
      
      The problem was triggered by these steps:
      
      1) create socket, bind and then setsockopt for add mc group.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("192.168.1.2");
         setsockopt(sockfd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));
      
      2) drop the mc group for this socket.
         mreq.imr_multiaddr.s_addr = inet_addr("255.0.0.37");
         mreq.imr_interface.s_addr = inet_addr("0.0.0.0");
         setsockopt(sockfd, IPPROTO_IP, IP_DROP_MEMBERSHIP, &mreq, sizeof(mreq));
      
      3) and then drop the socket, I found the mc group was still used by the dev:
      
         netstat -g
      
         Interface       RefCnt Group
         --------------- ------ ---------------------
         eth2		   1	  255.0.0.37
      
      Normally even though the IP_DROP_MEMBERSHIP return error, the mc group still need
      to be released for the netdev when drop the socket, but this process was broken when
      route default is NULL, the reason is that:
      
      The ip_mc_leave_group() will choose the in_dev by the imr_interface.s_addr, if input addr
      is NULL, the default route dev will be chosen, then the ifindex is got from the dev,
      then polling the inet->mc_list and return -ENODEV, but if the default route dev is NULL,
      the in_dev and ifIndex is both NULL, when polling the inet->mc_list, the mc group will be
      released from the mc_list, but the dev didn't dec the refcnt for this mc group, so
      when dropping the socket, the mc_list is NULL and the dev still keep this group.
      
      v1->v2: According Hideaki's suggestion, we should align with IPv6 (RFC3493) and BSDs,
      	so I add the checking for the in_dev before polling the mc_list, make sure when
      	we remove the mc group, dec the refcnt to the real dev which was using the mc address.
      	The problem would never happened again.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      b0b31561