1. 01 Oct, 2015 40 commits
    • Daniel Borkmann's avatar
      netlink, mmap: transform mmap skb into full skb on taps · 85aec632
      Daniel Borkmann authored
      [ Upstream commit 1853c949 ]
      
      Ken-ichirou reported that running netlink in mmap mode for receive in
      combination with nlmon will throw a NULL pointer dereference in
      __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
      to handle kernel paging request". The problem is the skb_clone() in
      __netlink_deliver_tap_skb() for skbs that are mmaped.
      
      I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
      skb has it pointed to netlink_skb_destructor(), set in the handler
      netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
      that in such cases, __kfree_skb() doesn't perform a skb_release_data()
      via skb_release_all(), where skb->head is possibly being freed through
      kfree(head) into slab allocator, although netlink mmap skb->head points
      to the mmap buffer. Similarly, the same has to be done also for large
      netlink skbs where the data area is vmalloced. Therefore, as discussed,
      make a copy for these rather rare cases for now. This fixes the issue
      on my and Ken-ichirou's test-cases.
      
      Reference: http://thread.gmane.org/gmane.linux.network/371129
      Fixes: bcbde0d4 ("net: netlink: virtual tap device management")
      Reported-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85aec632
    • Richard Laing's avatar
      net/ipv6: Correct PIM6 mrt_lock handling · c9cc1b7e
      Richard Laing authored
      [ Upstream commit 25b4a44c ]
      
      In the IPv6 multicast routing code the mrt_lock was not being released
      correctly in the MFC iterator, as a result adding or deleting a MIF would
      cause a hang because the mrt_lock could not be acquired.
      
      This fix is a copy of the code for the IPv4 case and ensures that the lock
      is released correctly.
      Signed-off-by: default avatarRichard Laing <richard.laing@alliedtelesis.co.nz>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9cc1b7e
    • Daniel Borkmann's avatar
      ipv6: fix exthdrs offload registration in out_rt path · c3ab9590
      Daniel Borkmann authored
      [ Upstream commit e41b0bed ]
      
      We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
      but in error path, we try to unregister it with inet_del_offload(). This
      doesn't seem correct, it should actually be inet6_del_offload(), also
      ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
      also uses rthdr_offload twice), but it got removed entirely later on.
      
      Fixes: 3336288a ("ipv6: Switch to using new offload infrastructure.")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c3ab9590
    • Eugene Shatokhin's avatar
      usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared · 1db7ccf3
      Eugene Shatokhin authored
      [ Upstream commit f50791ac ]
      
      It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
      usbnet_stop(), but its value should be read before it is cleared
      when dev->flags is set to 0.
      
      The problem was spotted and the fix was provided by
      Oliver Neukum <oneukum@suse.de>.
      Signed-off-by: default avatarEugene Shatokhin <eugene.shatokhin@rosalab.ru>
      Acked-by: default avatarOliver Neukum <oneukum@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1db7ccf3
    • huaibin Wang's avatar
      ip6_gre: release cached dst on tunnel removal · a4240b5e
      huaibin Wang authored
      [ Upstream commit d4257295 ]
      
      When a tunnel is deleted, the cached dst entry should be released.
      
      This problem may prevent the removal of a netns (seen with a x-netns IPv6
      gre tunnel):
        unregister_netdevice: waiting for lo to become free. Usage count = 3
      
      CC: Dmitry Kozlov <xeb@mail.ru>
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Signed-off-by: default avatarhuaibin Wang <huaibin.wang@6wind.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4240b5e
    • Jack Morgenstein's avatar
      net/mlx4_core: Fix wrong index in propagating port change event to VFs · f8952cdc
      Jack Morgenstein authored
      [ Upstream commit 1c1bf349 ]
      
      The port-change event processing in procedure mlx4_eq_int() uses "slave"
      as the vf_oper array index. Since the value of "slave" is the PF function
      index, the result is that the PF link state is used for deciding to
      propagate the event for all the VFs. The VF link state should be used,
      so the VF function index should be used here.
      
      Fixes: 948e306d ('net/mlx4: Add VF link state support')
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8952cdc
    • Florian Westphal's avatar
      netlink: don't hold mutex in rcu callback when releasing mmapd ring · afe15859
      Florian Westphal authored
      [ Upstream commit 0470eb99 ]
      
      Kirill A. Shutemov says:
      
      This simple test-case trigers few locking asserts in kernel:
      
      int main(int argc, char **argv)
      {
              unsigned int block_size = 16 * 4096;
              struct nl_mmap_req req = {
                      .nm_block_size          = block_size,
                      .nm_block_nr            = 64,
                      .nm_frame_size          = 16384,
                      .nm_frame_nr            = 64 * block_size / 16384,
              };
              unsigned int ring_size;
      	int fd;
      
      	fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
              if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
                      exit(1);
              if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
                      exit(1);
      
      	ring_size = req.nm_block_nr * req.nm_block_size;
      	mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
      	return 0;
      }
      
      +++ exited with 0 +++
      BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
      in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
      3 locks held by init/1:
       #0:  (reboot_mutex){+.+...}, at: [<ffffffff81080959>] SyS_reboot+0xa9/0x220
       #1:  ((reboot_notifier_list).rwsem){.+.+..}, at: [<ffffffff8107f379>] __blocking_notifier_call_chain+0x39/0x70
       #2:  (rcu_callback){......}, at: [<ffffffff810d32e0>] rcu_do_batch.isra.49+0x160/0x10c0
      Preemption disabled at:[<ffffffff8145365f>] __delay+0xf/0x20
      
      CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
       ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
       0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
       ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
      Call Trace:
       <IRQ>  [<ffffffff81929ceb>] dump_stack+0x4f/0x7b
       [<ffffffff81085a9d>] ___might_sleep+0x16d/0x270
       [<ffffffff81085bed>] __might_sleep+0x4d/0x90
       [<ffffffff8192e96f>] mutex_lock_nested+0x2f/0x430
       [<ffffffff81932fed>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
       [<ffffffff81464143>] ? __this_cpu_preempt_check+0x13/0x20
       [<ffffffff8182fc3d>] netlink_set_ring+0x1ed/0x350
       [<ffffffff8182e000>] ? netlink_undo_bind+0x70/0x70
       [<ffffffff8182fe20>] netlink_sock_destruct+0x80/0x150
       [<ffffffff817e484d>] __sk_free+0x1d/0x160
       [<ffffffff817e49a9>] sk_free+0x19/0x20
      [..]
      
      Cong Wang says:
      
      We can't hold mutex lock in a rcu callback, [..]
      
      Thomas Graf says:
      
      The socket should be dead at this point. It might be simpler to
      add a netlink_release_ring() function which doesn't require
      locking at all.
      Reported-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
      Diagnosed-by: default avatarCong Wang <cwang@twopensource.com>
      Suggested-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      afe15859
    • Edward Hyunkoo Jee's avatar
      inet: frags: fix defragmented packet's IP header for af_packet · ba63f0d8
      Edward Hyunkoo Jee authored
      [ Upstream commit 0848f642 ]
      
      When ip_frag_queue() computes positions, it assumes that the passed
      sk_buff does not contain L2 headers.
      
      However, when PACKET_FANOUT_FLAG_DEFRAG is used, IP reassembly
      functions can be called on outgoing packets that contain L2 headers.
      
      Also, IPv4 checksum is not corrected after reassembly.
      
      Fixes: 7736d33f ("packet: Add pre-defragmentation support for ipv4 fanouts.")
      Signed-off-by: default avatarEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba63f0d8
    • dingtianhong's avatar
      bonding: correct the MAC address for "follow" fail_over_mac policy · 2850eec9
      dingtianhong authored
      [ Upstream commit a951bc1e ]
      
      The "follow" fail_over_mac policy is useful for multiport devices that
      either become confused or incur a performance penalty when multiple
      ports are programmed with the same MAC address, but the same MAC
      address still may happened by this steps for this policy:
      
      1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
         bond0 has the same mac address with eth0, it is MAC1.
      
      2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
         eth1 is backup, eth1 has MAC2.
      
      3) ifconfig eth0 down
         eth1 became active slave, bond will swap MAC for eth0 and eth1,
         so eth1 has MAC1, and eth0 has MAC2.
      
      4) ifconfig eth1 down
         there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
      
      5) ifconfig eth0 up
         the eth0 became active slave again, the bond set eth0 to MAC1.
      
      Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the same
      MAC address, it will break this policy for ACTIVE_BACKUP mode.
      
      This patch will fix this problem by finding the old active slave and
      swap them MAC address before change active slave.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Tested-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2850eec9
    • Nikolay Aleksandrov's avatar
      bonding: fix destruction of bond with devices different from arphrd_ether · 55ae4d26
      Nikolay Aleksandrov authored
      [ Upstream commit 06f6d109 ]
      
      When the bonding is being unloaded and the netdevice notifier is
      unregistered it executes NETDEV_UNREGISTER for each device which should
      remove the bond's proc entry but if the device enslaved is not of
      ARPHRD_ETHER type and is in front of the bonding, it may execute
      bond_release_and_destroy() first which would release the last slave and
      destroy the bond device leaving the proc entry and thus we will get the
      following error (with dynamic debug on for bond_netdev_event to see the
      events order):
      [  908.963051] eql: event: 9
      [  908.963052] eql: IFF_SLAVE
      [  908.963054] eql: event: 2
      [  908.963056] eql: IFF_SLAVE
      [  908.963058] eql: event: 6
      [  908.963059] eql: IFF_SLAVE
      [  908.963110] bond0: Releasing active interface eql
      [  908.976168] bond0: Destroying bond bond0
      [  908.976266] bond0 (unregistering): Released all slaves
      [  908.984097] ------------[ cut here ]------------
      [  908.984107] WARNING: CPU: 0 PID: 1787 at fs/proc/generic.c:575
      remove_proc_entry+0x112/0x160()
      [  908.984110] remove_proc_entry: removing non-empty directory
      'net/bonding', leaking at least 'bond0'
      [  908.984111] Modules linked in: bonding(-) eql(O) 9p nfsd auth_rpcgss
      oid_registry nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul
      crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev qxl drm_kms_helper
      snd_hda_codec_generic aesni_intel ttm aes_x86_64 glue_helper pcspkr lrw
      gf128mul ablk_helper cryptd snd_hda_intel virtio_console snd_hda_codec
      psmouse serio_raw snd_hwdep snd_hda_core 9pnet_virtio 9pnet evdev joydev
      drm virtio_balloon snd_pcm snd_timer snd soundcore i2c_piix4 i2c_core
      pvpanic acpi_cpufreq parport_pc parport processor thermal_sys button
      autofs4 ext4 crc16 mbcache jbd2 hid_generic usbhid hid sg sr_mod cdrom
      ata_generic virtio_blk virtio_net floppy ata_piix e1000 libata ehci_pci
      virtio_pci scsi_mod uhci_hcd ehci_hcd virtio_ring virtio usbcore
      usb_common [last unloaded: bonding]
      
      [  908.984168] CPU: 0 PID: 1787 Comm: rmmod Tainted: G        W  O
      4.2.0-rc2+ #8
      [  908.984170] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  908.984172]  0000000000000000 ffffffff81732d41 ffffffff81525b34
      ffff8800358dfda8
      [  908.984175]  ffffffff8106c521 ffff88003595af78 ffff88003595af40
      ffff88003e3a4280
      [  908.984178]  ffffffffa058d040 0000000000000000 ffffffff8106c59a
      ffffffff8172ebd0
      [  908.984181] Call Trace:
      [  908.984188]  [<ffffffff81525b34>] ? dump_stack+0x40/0x50
      [  908.984193]  [<ffffffff8106c521>] ? warn_slowpath_common+0x81/0xb0
      [  908.984196]  [<ffffffff8106c59a>] ? warn_slowpath_fmt+0x4a/0x50
      [  908.984199]  [<ffffffff81218352>] ? remove_proc_entry+0x112/0x160
      [  908.984205]  [<ffffffffa05850e6>] ? bond_destroy_proc_dir+0x26/0x30
      [bonding]
      [  908.984208]  [<ffffffffa057540e>] ? bond_net_exit+0x8e/0xa0 [bonding]
      [  908.984217]  [<ffffffff8142f407>] ? ops_exit_list.isra.4+0x37/0x70
      [  908.984225]  [<ffffffff8142f52d>] ?
      unregister_pernet_operations+0x8d/0xd0
      [  908.984228]  [<ffffffff8142f58d>] ?
      unregister_pernet_subsys+0x1d/0x30
      [  908.984232]  [<ffffffffa0585269>] ? bonding_exit+0x23/0xdba [bonding]
      [  908.984236]  [<ffffffff810e28ba>] ? SyS_delete_module+0x18a/0x250
      [  908.984241]  [<ffffffff81086f99>] ? task_work_run+0x89/0xc0
      [  908.984244]  [<ffffffff8152b732>] ?
      entry_SYSCALL_64_fastpath+0x16/0x75
      [  908.984247] ---[ end trace 7c006ed4abbef24b ]---
      
      Thus remove the proc entry manually if bond_release_and_destroy() is
      used. Because of the checks in bond_remove_proc_entry() it's not a
      problem for a bond device to change namespaces (the bug fixed by the
      Fixes commit) but since commit
      f9399814 ("bonding: Don't allow bond devices to change network
      namespaces.") that can't happen anyway.
      Reported-by: default avatarCarol Soto <clsoto@linux.vnet.ibm.com>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Fixes: a64d49c3 ("bonding: Manage /proc/net/bonding/ entries from
                            the netdev events")
      Tested-by: default avatarCarol L Soto <clsoto@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55ae4d26
    • Eric Dumazet's avatar
      ipv6: lock socket in ip6_datagram_connect() · bc31dfc8
      Eric Dumazet authored
      [ Upstream commit 03645a11 ]
      
      ip6_datagram_connect() is doing a lot of socket changes without
      socket being locked.
      
      This looks wrong, at least for udp_lib_rehash() which could corrupt
      lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc31dfc8
    • Tilman Schmidt's avatar
      isdn/gigaset: reset tty->receive_room when attaching ser_gigaset · 006dc8bb
      Tilman Schmidt authored
      [ Upstream commit fd98e941 ]
      
      Commit 79901317 ("n_tty: Don't flush buffer when closing ldisc"),
      first merged in kernel release 3.10, caused the following regression
      in the Gigaset M101 driver:
      
      Before that commit, when closing the N_TTY line discipline in
      preparation to switching to N_GIGASET_M101, receive_room would be
      reset to a non-zero value by the call to n_tty_flush_buffer() in
      n_tty's close method. With the removal of that call, receive_room
      might be left at zero, blocking data reception on the serial line.
      
      The present patch fixes that regression by setting receive_room
      to an appropriate value in the ldisc open method.
      
      Fixes: 79901317 ("n_tty: Don't flush buffer when closing ldisc")
      Signed-off-by: default avatarTilman Schmidt <tilman@imap.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      006dc8bb
    • Nikolay Aleksandrov's avatar
      bridge: mdb: fix double add notification · cf4a8f24
      Nikolay Aleksandrov authored
      [ Upstream commit 5ebc7846 ]
      
      Since the mdb add/del code was introduced there have been 2 br_mdb_notify
      calls when doing br_mdb_add() resulting in 2 notifications on each add.
      
      Example:
       Command: bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
       Before patch:
       root@debian:~# bridge monitor all
       [MDB]dev br0 port eth1 grp 239.0.0.1 permanent
       [MDB]dev br0 port eth1 grp 239.0.0.1 permanent
      
       After patch:
       root@debian:~# bridge monitor all
       [MDB]dev br0 port eth1 grp 239.0.0.1 permanent
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Fixes: cfd56754 ("bridge: add support of adding and deleting mdb entries")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf4a8f24
    • Herbert Xu's avatar
      net: Fix skb_set_peeked use-after-free bug · 414913c2
      Herbert Xu authored
      [ Upstream commit a0a2a660 ]
      
      The commit 738ac1eb ("net: Clone
      skb before setting peeked flag") introduced a use-after-free bug
      in skb_recv_datagram.  This is because skb_set_peeked may create
      a new skb and free the existing one.  As it stands the caller will
      continue to use the old freed skb.
      
      This patch fixes it by making skb_set_peeked return the new skb
      (or the old one if unchanged).
      
      Fixes: 738ac1eb ("net: Clone skb before setting peeked flag")
      Reported-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Reviewed-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      414913c2
    • Herbert Xu's avatar
      net: Fix skb csum races when peeking · d8f7f61b
      Herbert Xu authored
      [ Upstream commit 89c22d8c ]
      
      When we calculate the checksum on the recv path, we store the
      result in the skb as an optimisation in case we need the checksum
      again down the line.
      
      This is in fact bogus for the MSG_PEEK case as this is done without
      any locking.  So multiple threads can peek and then store the result
      to the same skb, potentially resulting in bogus skb states.
      
      This patch fixes this by only storing the result if the skb is not
      shared.  This preserves the optimisations for the few cases where
      it can be done safely due to locking or other reasons, e.g., SIOCINQ.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8f7f61b
    • Herbert Xu's avatar
      net: Clone skb before setting peeked flag · 2482787e
      Herbert Xu authored
      [ Upstream commit 738ac1eb ]
      
      Shared skbs must not be modified and this is crucial for broadcast
      and/or multicast paths where we use it as an optimisation to avoid
      unnecessary cloning.
      
      The function skb_recv_datagram breaks this rule by setting peeked
      without cloning the skb first.  This causes funky races which leads
      to double-free.
      
      This patch fixes this by cloning the skb and replacing the skb
      in the list when setting skb->peeked.
      
      Fixes: a59322be ("[UDP]: Only increment counter on first peek/recv")
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2482787e
    • Julian Anastasov's avatar
      net: call rcu_read_lock early in process_backlog · 69f02461
      Julian Anastasov authored
      [ Upstream commit 2c17d27c ]
      
      Incoming packet should be either in backlog queue or
      in RCU read-side section. Otherwise, the final sequence of
      flush_backlog() and synchronize_net() may miss packets
      that can run without device reference:
      
      CPU 1                  CPU 2
                             skb->dev: no reference
                             process_backlog:__skb_dequeue
                             process_backlog:local_irq_enable
      
      on_each_cpu for
      flush_backlog =>       IPI(hardirq): flush_backlog
                             - packet not found in backlog
      
                             CPU delayed ...
      synchronize_net
      - no ongoing RCU
      read-side sections
      
      netdev_run_todo,
      rcu_barrier: no
      ongoing callbacks
                             __netif_receive_skb_core:rcu_read_lock
                             - too late
      free dev
                             process packet for freed dev
      
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69f02461
    • Julian Anastasov's avatar
      net: do not process device backlog during unregistration · 2095ab24
      Julian Anastasov authored
      [ Upstream commit e9e4dd32 ]
      
      commit 381c759d ("ipv4: Avoid crashing in ip_error")
      fixes a problem where processed packet comes from device
      with destroyed inetdev (dev->ip_ptr). This is not expected
      because inetdev_destroy is called in NETDEV_UNREGISTER
      phase and packets should not be processed after
      dev_close_many() and synchronize_net(). Above fix is still
      required because inetdev_destroy can be called for other
      reasons. But it shows the real problem: backlog can keep
      packets for long time and they do not hold reference to
      device. Such packets are then delivered to upper levels
      at the same time when device is unregistered.
      Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
      accounts all packets from backlog but before that some packets
      continue to be delivered to upper levels long after the
      synchronize_net call which is supposed to wait the last
      ones. Also, as Eric pointed out, processed packets, mostly
      from other devices, can continue to add new packets to backlog.
      
      Fix the problem by moving flush_backlog early, after the
      device driver is stopped and before the synchronize_net() call.
      Then use netif_running check to make sure we do not add more
      packets to backlog. We have to do it in enqueue_to_backlog
      context when the local IRQ is disabled. As result, after the
      flush_backlog and synchronize_net sequence all packets
      should be accounted.
      
      Thanks to Eric W. Biederman for the test script and his
      valuable feedback!
      Reported-by: default avatarVittorio Gambaletta <linuxbugs@vittgam.net>
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2095ab24
    • Oleg Nesterov's avatar
      net: pktgen: fix race between pktgen_thread_worker() and kthread_stop() · 45b5e68e
      Oleg Nesterov authored
      [ Upstream commit fecdf8be ]
      
      pktgen_thread_worker() is obviously racy, kthread_stop() can come
      between the kthread_should_stop() check and set_current_state().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Reported-by: default avatarMarcelo Leitner <mleitner@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45b5e68e
    • Nikolay Aleksandrov's avatar
      bridge: mdb: zero out the local br_ip variable before use · d46c3f82
      Nikolay Aleksandrov authored
      [ Upstream commit f1158b74 ]
      
      Since commit b0e9a30d ("bridge: Add vlan id to multicast groups")
      there's a check in br_ip_equal() for a matching vlan id, but the mdb
      functions were not modified to use (or at least zero it) so when an
      entry was added it would have a garbage vlan id (from the local br_ip
      variable in __br_mdb_add/del) and this would prevent it from being
      matched and also deleted. So zero out the whole local ip var to protect
      ourselves from future changes and also to fix the current bug, since
      there's no vlan id support in the mdb uapi - use always vlan id 0.
      Example before patch:
      root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb
      dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
      RTNETLINK answers: Invalid argument
      
      After patch:
      root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb
      dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Fixes: b0e9a30d ("bridge: Add vlan id to multicast groups")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d46c3f82
    • Stephen Smalley's avatar
      net/tipc: initialize security state for new connection socket · 4db9a972
      Stephen Smalley authored
      [ Upstream commit fdd75ea8 ]
      
      Calling connect() with an AF_TIPC socket would trigger a series
      of error messages from SELinux along the lines of:
      SELinux: Invalid class 0
      type=AVC msg=audit(1434126658.487:34500): avc:  denied  { <unprintable> }
        for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0
        tcontext=system_u:object_r:unlabeled_t:s0 tclass=<unprintable>
        permissive=0
      
      This was due to a failure to initialize the security state of the new
      connection sock by the tipc code, leaving it with junk in the security
      class field and an unlabeled secid.  Add a call to security_sk_clone()
      to inherit the security state from the parent socket.
      Reported-by: default avatarTim Shearer <tim.shearer@overturenetworks.com>
      Signed-off-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4db9a972
    • Timo Teräs's avatar
      ip_tunnel: fix ipv4 pmtu check to honor inner ip header df · 5a2c3656
      Timo Teräs authored
      [ Upstream commit fc24f2b2 ]
      
      Frag needed should be sent only if the inner header asked
      to not fragment. Currently fragmentation is broken if the
      tunnel has df set, but df was not asked in the original
      packet. The tunnel's df needs to be still checked to update
      internally the pmtu cache.
      
      Commit 23a3647b broke it, and this commit fixes
      the ipv4 df check back to the way it was.
      
      Fixes: 23a3647b ("ip_tunnels: Use skb-len to PMTU check.")
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarTimo Teräs <timo.teras@iki.fi>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a2c3656
    • Daniel Borkmann's avatar
      rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver · c8fa57bb
      Daniel Borkmann authored
      [ Upstream commit 4f7d2cdf ]
      
      Jason Gunthorpe reported that since commit c02db8c6 ("rtnetlink: make
      SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
      anymore with respect to their policy, that is, ifla_vfinfo_policy[].
      
      Before, they were part of ifla_policy[], but they have been nested since
      placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
      which is another nested attribute for the actual VF attributes such as
      IFLA_VF_MAC, IFLA_VF_VLAN, etc.
      
      Despite the policy being split out from ifla_policy[] in this commit,
      it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
      testing for struct nlattr, but it doesn't know about the data context and
      their requirements.
      
      Fix, on top of Jason's initial work, does 1) parsing of the attributes
      with the right policy, and 2) using the resulting parsed attribute table
      from 1) instead of the nla_for_each_nested() loop (just like we used to
      do when still part of ifla_policy[]).
      
      Reference: http://thread.gmane.org/gmane.linux.network/368913
      Fixes: c02db8c6 ("rtnetlink: make SR-IOV VF interface symmetric")
      Reported-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
      Cc: Greg Rose <gregory.v.rose@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Rony Efraim <ronye@mellanox.com>
      Cc: Vlad Zolotarov <vladz@cloudius-systems.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c8fa57bb
    • Eric Dumazet's avatar
      net: graceful exit from netif_alloc_netdev_queues() · 6133a5c1
      Eric Dumazet authored
      [ Upstream commit d339727c ]
      
      User space can crash kernel with
      
      ip link add ifb10 numtxqueues 100000 type ifb
      
      We must replace a BUG_ON() by proper test and return -EINVAL for
      crazy values.
      
      Fixes: 60877a32 ("net: allow large number of tx queues")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6133a5c1
    • Angga's avatar
      ipv6: Make MLD packets to only be processed locally · 1742bfe8
      Angga authored
      [ Upstream commit 4c938d22 ]
      
      Before commit daad1512 ("ipv6: Make ipv6_is_mld() inline and use it
      from ip6_mc_input().") MLD packets were only processed locally. After the
      change, a copy of MLD packet goes through ip6_mr_input, causing
      MRT6MSG_NOCACHE message to be generated to user space.
      
      Make MLD packet only processed locally.
      
      Fixes: daad1512 ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().")
      Signed-off-by: default avatarHermin Anggawijaya <hermin.anggawijaya@alliedtelesis.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1742bfe8
    • Hin-Tak Leung's avatar
      hfs,hfsplus: cache pages correctly between bnode_create and bnode_free · c824dd9a
      Hin-Tak Leung authored
      commit 7cb74be6 upstream.
      
      Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
      hfs_bnode_find() for finding or creating pages corresponding to an inode)
      are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
      and should not be page_cache_release()'ed until hfs_bnode_free().
      
      This patch fixes a problem I first saw in July 2012: merely running "du"
      on a large hfsplus-mounted directory a few times on a reasonably loaded
      system would get the hfsplus driver all confused and complaining about
      B-tree inconsistencies, and generates a "BUG: Bad page state".  Most
      recently, I can generate this problem on up-to-date Fedora 22 with shipped
      kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
      mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
      lightly-used QEMU VM image of the full Mac OS X 10.9:
      
      $ df -i / /home /mnt
      Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
      /dev/mapper/fedora-root    3276800  551665    2725135   17% /
      /dev/mapper/fedora-home   52879360  716221   52163139    2% /home
      /dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt
      
      After applying the patch, I was able to run "du /" (60+ times) and "du
      /mnt" (150+ times) continuously and simultaneously for 6+ hours.
      
      There are many reports of the hfsplus driver getting confused under load
      and generating "BUG: Bad page state" or other similar issues over the
      years.  [1]
      
      The unpatched code [2] has always been wrong since it entered the kernel
      tree.  The only reason why it gets away with it is that the
      kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
      the kernel has not had a chance to reuse the memory for something else,
      most of the time.
      
      The current RW driver appears to have followed the design and development
      of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
      2001) had a B-tree node-centric approach to
      read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
      migrating towards version 0.2 (June 2002) of caching and releasing pages
      per inode extents.  When the current RW code first entered the kernel [2]
      in 2005, there was an REF_PAGES conditional (and "//" commented out code)
      to switch between B-node centric paging to inode-centric paging.  There
      was a mistake with the direction of one of the REF_PAGES conditionals in
      __hfs_bnode_create().  In a subsequent "remove debug code" commit [4], the
      read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
      removed, but a page_cache_release() was mistakenly left in (propagating
      the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
      page_cache_release() in bnode_release() (which should be spanned by
      !REF_PAGES) was never enabled.
      
      References:
      [1]:
      Michael Fox, Apr 2013
      http://www.spinics.net/lists/linux-fsdevel/msg63807.html
      ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")
      
      Sasha Levin, Feb 2015
      http://lkml.org/lkml/2015/2/20/85 ("use after free")
      
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
      https://bugzilla.kernel.org/show_bug.cgi?id=42342
      https://bugzilla.kernel.org/show_bug.cgi?id=63841
      https://bugzilla.kernel.org/show_bug.cgi?id=78761
      
      [2]:
      http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
      fs/hfs/bnode.c?id=d1081202
      commit d1081202
      Author: Andrew Morton <akpm@osdl.org>
      Date:   Wed Feb 25 16:17:36 2004 -0800
      
          [PATCH] HFS rewrite
      
      http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
      fs/hfsplus/bnode.c?id=91556682
      
      commit 91556682
      Author: Andrew Morton <akpm@osdl.org>
      Date:   Wed Feb 25 16:17:48 2004 -0800
      
          [PATCH] HFS+ support
      
      [3]:
      http://sourceforge.net/projects/linux-hfsplus/
      
      http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
      http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/
      
      http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
      fs/hfsplus/bnode.c?r1=1.4&r2=1.5
      
      Date:   Thu Jun 6 09:45:14 2002 +0000
      Use buffer cache instead of page cache in bnode.c. Cache inode extents.
      
      [4]:
      http://git.kernel.org/cgit/linux/kernel/git/\
      stable/linux-stable.git/commit/?id=a5e3985f
      
      commit a5e3985f
      Author: Roman Zippel <zippel@linux-m68k.org>
      Date:   Tue Sep 6 15:18:47 2005 -0700
      
      [PATCH] hfs: remove debug code
      Signed-off-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Signed-off-by: default avatarSergei Antonov <saproj@gmail.com>
      Reviewed-by: default avatarAnton Altaparmakov <anton@tuxera.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Sougata Santra <sougata@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c824dd9a
    • Alexey Brodkin's avatar
      stmmac: troubleshoot unexpected bits in des0 & des1 · bf6391e2
      Alexey Brodkin authored
      commit f1590670 upstream.
      
      Current implementation of descriptor init procedure only takes
      care about setting/clearing ownership flag in "des0"/"des1"
      fields while it is perfectly possible to get unexpected bits
      set because of the following factors:
      
       [1] On driver probe underlying memory allocated with
           dma_alloc_coherent() might not be zeroed and so
           it will be filled with garbage.
      
       [2] During driver operation some bits could be set by SD/MMC
           controller (for example error flags etc).
      
      And unexpected and/or randomly set flags in "des0"/"des1"
      fields may lead to unpredictable behavior of GMAC DMA block.
      
      This change addresses both items above with:
      
       [1] Use of dma_zalloc_coherent() instead of simple
           dma_alloc_coherent() to make sure allocated memory is
           zeroed. That shouldn't affect performance because
           this allocation only happens once on driver probe.
      
       [2] Do explicit zeroing of both "des0" and "des1" fields
           of all buffer descriptors during initialization of
           DMA transfer.
      
      And while at it fixed identation of dma_free_coherent()
      counterpart as well.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: arc-linux-dev@synopsys.com
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf6391e2
    • Noa Osherovich's avatar
      IB/mlx4: Use correct SL on AH query under RoCE · 86f1e069
      Noa Osherovich authored
      commit 5e99b139 upstream.
      
      The mlx4 IB driver implementation for ib_query_ah used a wrong offset
      (28 instead of 29) when link type is Ethernet. Fixed to use the correct one.
      
      Fixes: fa417f7b ('IB/mlx4: Add support for IBoE')
      Signed-off-by: default avatarShani Michaeli <shanim@mellanox.com>
      Signed-off-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86f1e069
    • Jack Morgenstein's avatar
      IB/mlx4: Forbid using sysfs to change RoCE pkeys · 3fb968bf
      Jack Morgenstein authored
      commit 2b135db3 upstream.
      
      The pkey mapping for RoCE must remain the default mapping:
      VFs:
        virtual index 0 = mapped to real index 0 (0xFFFF)
        All others indices: mapped to a real pkey index containing an
                            invalid pkey.
      PF:
        virtual index i = real index i.
      
      Don't allow users to change these mappings using files found in
      sysfs.
      
      Fixes: c1e7e466 ('IB/mlx4: Add iov directory in sysfs under the ib device')
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fb968bf
    • Yishai Hadas's avatar
      IB/uverbs: Fix race between ib_uverbs_open and remove_one · 20dbae92
      Yishai Hadas authored
      commit 35d4a0b6 upstream.
      
      Fixes: 2a72f212 ("IB/uverbs: Remove dev_table")
      
      Before this commit there was a device look-up table that was protected
      by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When
      it was dropped and container_of was used instead, it enabled the race
      with remove_one as dev might be freed just after:
      dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev) but
      before the kref_get.
      
      In addition, this buggy patch added some dead code as
      container_of(x,y,z) can never be NULL and so dev can never be NULL.
      As a result the comment above ib_uverbs_open saying "the open method
      will either immediately run -ENXIO" is wrong as it can never happen.
      
      The solution follows Jason Gunthorpe suggestion from below URL:
      https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html
      
      cdev will hold a kref on the parent (the containing structure,
      ib_uverbs_device) and only when that kref is released it is
      guaranteed that open will never be called again.
      
      In addition, fixes the active count scheme to use an atomic
      not a kref to prevent WARN_ON as pointed by above comment
      from Jason.
      Signed-off-by: default avatarYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: default avatarShachar Raindel <raindel@mellanox.com>
      Reviewed-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20dbae92
    • Christoph Hellwig's avatar
      IB/uverbs: reject invalid or unknown opcodes · 58922dda
      Christoph Hellwig authored
      commit b632ffa7 upstream.
      
      We have many WR opcodes that are only supported in kernel space
      and/or require optional information to be copied into the WR
      structure.  Reject all those not explicitly handled so that we
      can't pass invalid information to drivers.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58922dda
    • Mike Marciniszyn's avatar
      IB/qib: Change lkey table allocation to support more MRs · 25a58b07
      Mike Marciniszyn authored
      commit d6f1c17e upstream.
      
      The lkey table is allocated with with a get_user_pages() with an
      order based on a number of index bits from a module parameter.
      
      The underlying kernel code cannot allocate that many contiguous pages.
      
      There is no reason the underlying memory needs to be physically
      contiguous.
      
      This patch:
      - switches the allocation/deallocation to vmalloc/vfree
      - caps the number of bits to 23 to insure at least 1 generation bit
        o this matches the module parameter description
      Reviewed-by: default avatarVinit Agnihotri <vinit.abhay.agnihotri@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25a58b07
    • Hin-Tak Leung's avatar
      hfs: fix B-tree corruption after insertion at position 0 · 18fd1e93
      Hin-Tak Leung authored
      commit b4cc0efe upstream.
      
      Fix B-tree corruption when a new record is inserted at position 0 in the
      node in hfs_brec_insert().
      
      This is an identical change to the corresponding hfs b-tree code to Sergei
      Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
      to keep similar code paths in the hfs and hfsplus drivers in sync, where
      appropriate.
      Signed-off-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Cc: Sergei Antonov <saproj@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Reviewed-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18fd1e93
    • David Vrabel's avatar
      xen/gntdev: convert priv->lock to a mutex · e254330e
      David Vrabel authored
      commit 1401c00e upstream.
      
      Unmapping may require sleeping and we unmap while holding priv->lock, so
      convert it to a mutex.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <ian.campbell@citrix.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e254330e
    • NeilBrown's avatar
      md/raid10: always set reshape_safe when initializing reshape_position. · a78b8633
      NeilBrown authored
      commit 299b0685 upstream.
      
      'reshape_position' tracks where in the reshape we have reached.
      'reshape_safe' tracks where in the reshape we have safely recorded
      in the metadata.
      
      These are compared to determine when to update the metadata.
      So it is important that reshape_safe is initialised properly.
      Currently it isn't.  When starting a reshape from the beginning
      it usually has the correct value by luck.  But when reducing the
      number of devices in a RAID10, it has the wrong value and this leads
      to the metadata not being updated correctly.
      This can lead to corruption if the reshape is not allowed to complete.
      
      This patch is suitable for any -stable kernel which supports RAID10
      reshape, which is 3.5 and later.
      
      Fixes: 3ea7daa5 ("md/raid10: add reshape support")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a78b8633
    • Jialing Fu's avatar
      mmc: core: fix race condition in mmc_wait_data_done · a30c50d2
      Jialing Fu authored
      commit 71f8a4b8 upstream.
      
      The following panic is captured in ker3.14, but the issue still exists
      in latest kernel.
      ---------------------------------------------------------------------
      [   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer dereference
      at virtual address 00000578
      ......
      [   20.738499] c0 3136 (Compiler) PC is at _raw_spin_lock_irqsave+0x24/0x60
      [   20.738527] c0 3136 (Compiler) LR is at _raw_spin_lock_irqsave+0x20/0x60
      [   20.740134] c0 3136 (Compiler) Call trace:
      [   20.740165] c0 3136 (Compiler) [<ffffffc0008ee900>] _raw_spin_lock_irqsave+0x24/0x60
      [   20.740200] c0 3136 (Compiler) [<ffffffc0000dd024>] __wake_up+0x1c/0x54
      [   20.740230] c0 3136 (Compiler) [<ffffffc000639414>] mmc_wait_data_done+0x28/0x34
      [   20.740262] c0 3136 (Compiler) [<ffffffc0006391a0>] mmc_request_done+0xa4/0x220
      [   20.740314] c0 3136 (Compiler) [<ffffffc000656894>] sdhci_tasklet_finish+0xac/0x264
      [   20.740352] c0 3136 (Compiler) [<ffffffc0000a2b58>] tasklet_action+0xa0/0x158
      [   20.740382] c0 3136 (Compiler) [<ffffffc0000a2078>] __do_softirq+0x10c/0x2e4
      [   20.740411] c0 3136 (Compiler) [<ffffffc0000a24bc>] irq_exit+0x8c/0xc0
      [   20.740439] c0 3136 (Compiler) [<ffffffc00008489c>] handle_IRQ+0x48/0xac
      [   20.740469] c0 3136 (Compiler) [<ffffffc000081428>] gic_handle_irq+0x38/0x7c
      ----------------------------------------------------------------------
      Because in SMP, "mrq" has race condition between below two paths:
      path1: CPU0: <tasklet context>
        static void mmc_wait_data_done(struct mmc_request *mrq)
        {
           mrq->host->context_info.is_done_rcv = true;
           //
           // If CPU0 has just finished "is_done_rcv = true" in path1, and at
           // this moment, IRQ or ICache line missing happens in CPU0.
           // What happens in CPU1 (path2)?
           //
           // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
           // path2 would have chance to break from wait_event_interruptible
           // in mmc_wait_for_data_req_done and continue to run for next
           // mmc_request (mmc_blk_rw_rq_prep).
           //
           // Within mmc_blk_rq_prep, mrq is cleared to 0.
           // If below line still gets host from "mrq" as the result of
           // compiler, the panic happens as we traced.
           wake_up_interruptible(&mrq->host->context_info.wait);
        }
      
      path2: CPU1: <The mmcqd thread runs mmc_queue_thread>
        static int mmc_wait_for_data_req_done(...
        {
           ...
           while (1) {
                 wait_event_interruptible(context_info->wait,
                         (context_info->is_done_rcv ||
                          context_info->is_new_req));
           	   static void mmc_blk_rw_rq_prep(...
                 {
                 ...
                 memset(brq, 0, sizeof(struct mmc_blk_request));
      
      This issue happens very coincidentally; however adding mdelay(1) in
      mmc_wait_data_done as below could duplicate it easily.
      
         static void mmc_wait_data_done(struct mmc_request *mrq)
         {
           mrq->host->context_info.is_done_rcv = true;
      +    mdelay(1);
           wake_up_interruptible(&mrq->host->context_info.wait);
          }
      
      At runtime, IRQ or ICache line missing may just happen at the same place
      of the mdelay(1).
      
      This patch gets the mmc_context_info at the beginning of function, it can
      avoid this race condition.
      Signed-off-by: default avatarJialing Fu <jlfu@marvell.com>
      Tested-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Fixes: 2220eedf ("mmc: fix async request mechanism ....")
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a30c50d2
    • Jann Horn's avatar
      fs: if a coredump already exists, unlink and recreate with O_EXCL · 9dbc5c78
      Jann Horn authored
      commit fbb18169 upstream.
      
      It was possible for an attacking user to trick root (or another user) into
      writing his coredumps into an attacker-readable, pre-existing file using
      rename() or link(), causing the disclosure of secret data from the victim
      process' virtual memory.  Depending on the configuration, it was also
      possible to trick root into overwriting system files with coredumps.  Fix
      that issue by never writing coredumps into existing files.
      
      Requirements for the attack:
       - The attack only applies if the victim's process has a nonzero
         RLIMIT_CORE and is dumpable.
       - The attacker can trick the victim into coredumping into an
         attacker-writable directory D, either because the core_pattern is
         relative and the victim's cwd is attacker-writable or because an
         absolute core_pattern pointing to a world-writable directory is used.
       - The attacker has one of these:
        A: on a system with protected_hardlinks=0:
           execute access to a folder containing a victim-owned,
           attacker-readable file on the same partition as D, and the
           victim-owned file will be deleted before the main part of the attack
           takes place. (In practice, there are lots of files that fulfill
           this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
           This does not apply to most Linux systems because most distros set
           protected_hardlinks=1.
        B: on a system with protected_hardlinks=1:
           execute access to a folder containing a victim-owned,
           attacker-readable and attacker-writable file on the same partition
           as D, and the victim-owned file will be deleted before the main part
           of the attack takes place.
           (This seems to be uncommon.)
        C: on any system, independent of protected_hardlinks:
           write access to a non-sticky folder containing a victim-owned,
           attacker-readable file on the same partition as D
           (This seems to be uncommon.)
      
      The basic idea is that the attacker moves the victim-owned file to where
      he expects the victim process to dump its core.  The victim process dumps
      its core into the existing file, and the attacker reads the coredump from
      it.
      
      If the attacker can't move the file because he does not have write access
      to the containing directory, he can instead link the file to a directory
      he controls, then wait for the original link to the file to be deleted
      (because the kernel checks that the link count of the corefile is 1).
      
      A less reliable variant that requires D to be non-sticky works with link()
      and does not require deletion of the original link: link() the file into
      D, but then unlink() it directly before the kernel performs the link count
      check.
      
      On systems with protected_hardlinks=0, this variant allows an attacker to
      not only gain information from coredumps, but also clobber existing,
      victim-writable files with coredumps.  (This could theoretically lead to a
      privilege escalation.)
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9dbc5c78
    • Jaewon Kim's avatar
      vmscan: fix increasing nr_isolated incurred by putback unevictable pages · 8803492d
      Jaewon Kim authored
      commit c54839a7 upstream.
      
      reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
      number of pages removed from the candidate list.  But shrink_page_list()
      puts back mlocked pages without passing it to caller and without
      counting as nr_reclaimed.  This increases nr_isolated.
      
      To fix this, this patch changes shrink_page_list() to pass unevictable
      pages back to caller.  Caller will take care those pages.
      
      Minchan said:
      
      It fixes two issues.
      
      1. With unevictable page, cma_alloc will be successful.
      
      Exactly speaking, cma_alloc of current kernel will fail due to
      unevictable pages.
      
      2. fix leaking of NR_ISOLATED counter of vmstat
      
      With it, too_many_isolated works.  Otherwise, it could make hang until
      the process get SIGKILL.
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8803492d
    • Helge Deller's avatar
      parisc: Filter out spurious interrupts in PA-RISC irq handler · 136af9b8
      Helge Deller authored
      commit b1b4e435 upstream.
      
      When detecting a serial port on newer PA-RISC machines (with iosapic) we have a
      long way to go to find the right IRQ line, registering it, then registering the
      serial port and the irq handler for the serial port. During this phase spurious
      interrupts for the serial port may happen which then crashes the kernel because
      the action handler might not have been set up yet.
      
      So, basically it's a race condition between the serial port hardware and the
      CPU which sets up the necessary fields in the irq sructs. The main reason for
      this race is, that we unmask the serial port irqs too early without having set
      up everything properly before (which isn't easily possible because we need the
      IRQ number to register the serial ports).
      
      This patch is a work-around for this problem. It adds checks to the CPU irq
      handler to verify if the IRQ action field has been initialized already. If not,
      we just skip this interrupt (which isn't critical for a serial port at bootup).
      The real fix would probably involve rewriting all PA-RISC specific IRQ code
      (for CPU, IOSAPIC, GSC and EISA) to use IRQ domains with proper parenting of
      the irq chips and proper irq enabling along this line.
      
      This bug has been in the PA-RISC port since the beginning, but the crashes
      happened very rarely with currently used hardware.  But on the latest machine
      which I bought (a C8000 workstation), which uses the fastest CPUs (4 x PA8900,
      1GHz) and which has the largest possible L1 cache size (64MB each), the kernel
      crashed at every boot because of this race. So, without this patch the machine
      would currently be unuseable.
      
      For the record, here is the flow logic:
      1. serial_init_chip() in 8250_gsc.c calls iosapic_serial_irq().
      2. iosapic_serial_irq() calls txn_alloc_irq() to find the irq.
      3. iosapic_serial_irq() calls cpu_claim_irq() to register the CPU irq
      4. cpu_claim_irq() unmasks the CPU irq (which it shouldn't!)
      5. serial_init_chip() then registers the 8250 port.
      Problems:
      - In step 4 the CPU irq shouldn't have been registered yet, but after step 5
      - If serial irq happens between 4 and 5 have finished, the kernel will crash
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      136af9b8
    • John David Anglin's avatar
      parisc: Use double word condition in 64bit CAS operation · 2d608a34
      John David Anglin authored
      commit 1b59ddfc upstream.
      
      The attached change fixes the condition used in the "sub" instruction.
      A double word comparison is needed.  This fixes the 64-bit LWS CAS
      operation on 64-bit kernels.
      
      I can now enable 64-bit atomic support in GCC.
      
      Signed-off-by: John David Anglin <dave.anglin>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d608a34