1. 27 Jun, 2014 32 commits
    • Jan Kara's avatar
      ocfs2: implement delayed dropping of last dquot reference · cac99bee
      Jan Kara authored
      commit e3a767b6 upstream.
      
      We cannot drop last dquot reference from downconvert thread as that
      creates the following deadlock:
      
      NODE 1                                  NODE2
      holds dentry lock for 'foo'
      holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
                                              dquot_initialize(bar)
                                                ocfs2_dquot_acquire()
                                                  ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                                                  ...
      downconvert thread (triggered from another
      node or a different process from NODE2)
        ocfs2_dentry_post_unlock()
          ...
          iput(foo)
            ocfs2_evict_inode(foo)
              ocfs2_clear_inode(foo)
                dquot_drop(inode)
                  ...
      	    ocfs2_dquot_release()
                    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                     - blocks
                                                  finds we need more space in
                                                  quota file
                                                  ...
                                                  ocfs2_extend_no_holes()
                                                    ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
                                                      - deadlocks waiting for
                                                        downconvert thread
      
      We solve the problem by postponing dropping of the last dquot reference to
      a workqueue if it happens from the downconvert thread.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cac99bee
    • Jan Kara's avatar
      quota: provide function to grab quota structure reference · b5258061
      Jan Kara authored
      commit 9f985cb6 upstream.
      
      Provide dqgrab() function to get quota structure reference when we are
      sure it already has at least one active reference.  Make use of this
      function inside quota code.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b5258061
    • Jan Kara's avatar
      ocfs2: move dquot_initialize() in ocfs2_delete_inode() somewhat later · 2eb0658f
      Jan Kara authored
      commit bd62ad7a upstream.
      
      Move dquot_initalize() call in ocfs2_delete_inode() after the moment we
      verify inode is actually a sane one to delete.  We certainly don't want
      to initialize quota for system inodes etc.  This also avoids calling
      into quota code from downconvert thread.
      
      Add more details into the comment why bailing out from
      ocfs2_delete_inode() when we are in downconvert thread is OK.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2eb0658f
    • Lidong Zhong's avatar
      dlm: keep listening connection alive with sctp mode · 34b6c049
      Lidong Zhong authored
      commit 883854c5 upstream.
      
      The connection struct with nodeid 0 is the listening socket,
      not a connection to another node.  The sctp resend function
      was not checking that the nodeid was valid (non-zero), so it
      would mistakenly get and resend on the listening connection
      when nodeid was zero.
      Signed-off-by: default avatarLidong Zhong <lzhong@suse.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      34b6c049
    • Miao Xie's avatar
      Btrfs: fix BUG_ON() casued by the reserved space migration · 9bf37c05
      Miao Xie authored
      commit 20dd2cbf upstream.
      
      When we did space balance and snapshot creation at the same time, we might
      meet the following oops:
       kernel BUG at fs/btrfs/inode.c:3038!
       [SNIP]
       Call Trace:
       [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
       [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
       [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
       [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
       [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
       [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
       [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
       [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
       [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
       [<ffffffff81121e88>] SyS_ioctl+0x57/0x83
       [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
       [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
       [SNIP]
       RIP  [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]
      
      The reason of the problem is that the relocation root creation stole
      the reserved space, which was reserved for orphan item deletion.
      
      There are several ways to fix this problem, one is to increasing
      the reserved space size of the space balace, and then we can use
      that space to create the relocation tree for each fs/file trees.
      But it is hard to calculate the suitable size because we doesn't
      know how many fs/file trees we need relocate.
      
      We fixed this problem by reserving the space for relocation root creation
      actively since the space it need is very small (one tree block, used for
      root node copy), then we use that reserved space to create the
      relocation tree. If we don't reserve space for relocation tree creation,
      we will use the reserved space of the balance.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9bf37c05
    • Josef Bacik's avatar
      Btrfs: fix two use-after-free bugs with transaction cleanup · 8b16b61c
      Josef Bacik authored
      commit 724e2315 upstream.
      
      I was noticing the slab redzone stuff going off every once and a while during
      transaction aborts.  This was caused by two things
      
      1) We would walk the pending snapshots and set their error to -ECANCELED.  We
      don't need to do this, the snapshot stuff waits for a transaction commit and if
      there is a problem we just free our pending snapshot object and exit.  Doing
      this was causing us to touch the pending snapshot object after the thing had
      already been freed.
      
      2) We were freeing the transaction manually with wanton disregard for it's
      use_count reference counter.  To fix this I cleaned up the transaction freeing
      loop to either wait for the transaction commit to finish if it was in the middle
      of that (since it will be cleaned and freed up there) or to do the cleanup
      oursevles.
      
      I also moved the global "kill all things dirty everywhere" stuff outside of the
      transaction cleanup loop since that only needs to be done once.  With this patch
      I'm no longer seeing slab corruption because of use after frees.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8b16b61c
    • Josef Bacik's avatar
      Btrfs: don't delete ordered roots from list during cleanup · 445d1c3a
      Josef Bacik authored
      commit 1de2cfde upstream.
      
      During transaction cleanup after an abort we are just removing roots from the
      ordered roots list which is incorrect.  We have a BUG_ON() to make sure that the
      root is still part of the ordered roots list when we put our ordered extent
      which we were tripping in this case.  So do like we do everywhere else and just
      move it to the tail of the ordered roots list and allow the normal cleanup to
      take care of stuff.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      445d1c3a
    • Josef Bacik's avatar
      Btrfs: cleanup transaction on abort · bd32872f
      Josef Bacik authored
      commit 4e121c06 upstream.
      
      If we abort not during a transaction commit we won't clean up anything until we
      unmount.  Unfortunately if we abort in the middle of writing out an ordered
      extent we won't clean it up and if somebody is waiting on that ordered extent
      they will wait forever.  To fix this just make the transaction kthread call the
      cleanup transaction stuff if it notices theres an error, and make
      btrfs_end_transaction wake up the transaction kthread if there is an error.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bd32872f
    • Josef Bacik's avatar
      Btrfs: do not release metadata for space cache inodes · 4b6d66d1
      Josef Bacik authored
      commit b6d08f06 upstream.
      
      I've been testing our error paths and I was tripping the BUG_ON() in
      drop_outstanding_extent because our outstanding_extents is 0 for space cache
      inodes.  This is because we don't reserve metadata space for these inodes since
      we depend on the global block reserve for our space.  To fix this we need to
      make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for
      space cache inodes.  With this patch I'm no longer panicing.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4b6d66d1
    • Filipe David Borba Manana's avatar
      Btrfs: don't leak block group on error · 596075a2
      Filipe David Borba Manana authored
      commit e84cc142 upstream.
      
      In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to
      write_one_cache_group() failed, we would return without putting
      the block group first.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      596075a2
    • Filipe David Borba Manana's avatar
      Btrfs: fix sync fs to actually wait for all data to be persisted · a2ea3d78
      Filipe David Borba Manana authored
      commit 9b199859 upstream.
      
      Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't
      wait for delayed work to finish before returning success to the
      caller. This change fixes this, ensuring that there's no data loss
      if a power failure happens right after fs sync returns success to
      the caller and before the next commit happens.
      
      Steps to reproduce the data loss issue:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt/btrfs
      $ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs
      
      Right after the btrfs fi sync command (a second or 2 for example), power
      off the machine and reboot it. The file will be empty, as it can be verified
      after mounting the filesystem and through btrfs-debug-tree:
      
      $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8
              item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36
                      location key (257 INODE_ITEM 0) type FILE
                      namelen 6 datalen 0 name: foobar
              item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
                      inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1
              item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16
                      inode ref index 2 namelen 6 name: foobar
      checksum tree key (CSUM_TREE ROOT_ITEM 0)
      leaf 29429760 items 0 free space 3995 generation 7 owner 7
      fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e
      chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae
      uuid tree key (UUID_TREE ROOT_ITEM 0)
      
      After this patch, the data loss no longer happens after a power failure and
      btrfs-debug-tree shows:
      
      $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8
      	item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36
      		location key (257 INODE_ITEM 0) type FILE
      		namelen 6 datalen 0 name: foobar
      	item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
      		inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1
      	item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16
      		inode ref index 2 namelen 6 name: foobar
      	item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53
      		extent data disk byte 12845056 nr 8192
      		extent data offset 0 nr 8192 ram 8192
      		extent compression 0
      checksum tree key (CSUM_TREE ROOT_ITEM 0)
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a2ea3d78
    • Filipe David Borba Manana's avatar
      Btrfs: fix tracking of orphan inode count · f19eb84e
      Filipe David Borba Manana authored
      commit 703c88e0 upstream.
      
      In inode.c:btrfs_orphan_add() if we failed to insert the orphan
      item, we would return without decrementing the orphan count that
      we just incremented before attempting the insertion, leaving the
      orphan inode count wrong.
      
      In inode.c:btrfs_orphan_del(), we were decrementing the inode
      orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set,
      which is logically wrong because it should be decremented if the
      bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment
      the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f19eb84e
    • Steve French's avatar
      Do not send ClientGUID on SMB2.02 dialect · c82b3dd9
      Steve French authored
      commit 3c5f9be1 upstream.
      
      ClientGUID must be zero for SMB2.02 dialect.  See section 2.2.3
      of MS-SMB2. For SMB2.1 and later it must be non-zero.
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      CC: Sachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c82b3dd9
    • Sachin Prabhu's avatar
      cifs: Set client guid on per connection basis · 8f7e86ca
      Sachin Prabhu authored
      commit 39552ea8 upstream.
      
      When mounting from a Windows 2012R2 server, we hit the following
      problem:
      1) Mount with any of the following versions - 2.0, 2.1 or 3.0
      2) unmount
      3) Attempt a mount again using a different SMB version >= 2.0.
      
      You end up with the following failure:
      Status code returned 0xc0000203 STATUS_USER_SESSION_DELETED
      CIFS VFS: Send error in SessSetup = -5
      CIFS VFS: cifs_mount failed w/return code = -5
      
      I cannot reproduce this issue using a Windows 2008 R2 server.
      
      This appears to be caused because we use the same client guid for the
      connection on first mount which we then disconnect and attempt to mount
      again using a different protocol version. By generating a new guid each
      time a new connection is Negotiated, we avoid hitting this problem.
      Signed-off-by: default avatarSachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8f7e86ca
    • Steve French's avatar
      Check SMB3 dialects against downgrade attacks · 16e57e55
      Steve French authored
      commit ff1c038a upstream.
      
      When we are running SMB3 or SMB3.02 connections which are signed
      we need to validate the protocol negotiation information,
      to ensure that the negotiate protocol response was not tampered with.
      
      Add the missing FSCTL which is sent at mount time (immediately after
      the SMB3 Tree Connect) to validate that the capabilities match
      what we think the server sent.
      
      "Secure dialect negotiation is introduced in SMB3 to protect against
      man-in-the-middle attempt to downgrade dialect negotiation.
      The idea is to prevent an eavesdropper from downgrading the initially
      negotiated dialect and capabilities between the client and the server."
      
      For more explanation see 2.2.31.4 of MS-SMB2 or
      http://blogs.msdn.com/b/openspecification/archive/2012/06/28/smb3-secure-dialect-negotiation.aspxReviewed-by: default avatarPavel Shilovsky <piastry@etersoft.ru>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      [ddiss@suse.de: backported atop kernel without clone_range support]
      Signed-off-by: default avatarDavid Disseldorp <ddiss@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      16e57e55
    • Michal Kubecek's avatar
      xfrm: fix race between netns cleanup and state expire notification · 3f8fd8ad
      Michal Kubecek authored
      commit 21ee543e upstream.
      
      The xfrm_user module registers its pernet init/exit after xfrm
      itself so that its net exit function xfrm_user_net_exit() is
      executed before xfrm_net_exit() which calls xfrm_state_fini() to
      cleanup the SA's (xfrm states). This opens a window between
      zeroing net->xfrm.nlsk pointer and deleting all xfrm_state
      instances which may access it (via the timer). If an xfrm state
      expires in this window, xfrm_exp_state_notify() will pass null
      pointer as socket to nlmsg_multicast().
      
      As the notifications are called inside rcu_read_lock() block, it
      is sufficient to retrieve the nlsk socket with rcu_dereference()
      and check the it for null.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3f8fd8ad
    • Michal Kubeček's avatar
      vlan: more careful checksum features handling · 306ba5b2
      Michal Kubeček authored
      commit da08143b upstream.
      
      When combining real_dev's features and vlan_features, simple
      bitwise AND is used. This doesn't work well for checksum
      offloading features as if one set has NETIF_F_HW_CSUM and the
      other NETIF_F_IP_CSUM and/or NETIF_F_IPV6_CSUM, we end up with
      no checksum offloading. However, from the logical point of view
      (how can_checksum_protocol() works), NETIF_F_HW_CSUM contains
      the functionality of NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM so
      that the result should be IP/IPV6.
      
      Add helper function netdev_intersect_features() implementing
      this logic and use it in vlan_dev_fix_features().
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      306ba5b2
    • Ben Hutchings's avatar
      net/compat: Fix minor information leak in siocdevprivate_ioctl() · e45145b6
      Ben Hutchings authored
      commit 417c3522 upstream.
      
      We don't need to check that ifr_data itself is a valid user pointer,
      but we should check &ifr_data is.  Thankfully the copy of ifr_name is
      checked, so this can only leak a few bytes from immediately above the
      user address limit.
      Signed-off-by: default avatarBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e45145b6
    • Benjamin Poirier's avatar
      net: Do not enable tx-nocache-copy by default · 9f6e089c
      Benjamin Poirier authored
      commit cdb3f4a3 upstream.
      
      There are many cases where this feature does not improve performance or even
      reduces it.
      
      For example, here are the results from tests that I've run using 3.12.6 on one
      Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
      from the Xeon, but they're similar on the i7. All numbers report the
      mean±stddev over 10 runs of 10s.
      
      1) latency tests similar to what is described in "c6e1a0d1 net: Allow no-cache
      copy from user on transmit"
      There is no statistically significant difference between tx-nocache-copy
      on/off.
      nic irqs spread out (one queue per cpu)
      
      200x netperf -r 1400,1
      tx-nocache-copy off
              692000±1000 tps
              50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
      tx-nocache-copy on
              693000±1000 tps
              50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7
      
      200x netperf -r 14000,14000
      tx-nocache-copy off
              86450±80 tps
              50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
      tx-nocache-copy on
              86110±60 tps
              50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20
      
      2) single stream throughput tests
      tx-nocache-copy leads to higher service demand
      
                              throughput  cpu0        cpu1        demand
                              (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)
      
      nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)
      
      tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
      tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004
      
      nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)
      
      tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
      tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002
      
      As a second example, here are some results from Eric Dumazet with latest
      net-next.
      tx-nocache-copy also leads to higher service demand
      
      (cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)
      
      lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
      lpq83:~# perf stat ./netperf -H lpq84 -c
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
      
       87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000
      
       Performance counter stats for './netperf -H lpq84 -c':
      
             4282.648396 task-clock                #    0.423 CPUs utilized
                   9,348 context-switches          #    0.002 M/sec
                      88 CPU-migrations            #    0.021 K/sec
                     355 page-faults               #    0.083 K/sec
          11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
           9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
           4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
           6,053,172,792 instructions              #    0.51  insns per cycle
                                                   #    1.49  stalled cycles per insn [83.64%]
             597,275,583 branches                  #  139.464 M/sec                   [83.70%]
               8,960,541 branch-misses             #    1.50% of all branches         [83.65%]
      
            10.128990264 seconds time elapsed
      
      lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
      lpq83:~# perf stat ./netperf -H lpq84 -c
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
      
       87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000
      
       Performance counter stats for './netperf -H lpq84 -c':
      
             2847.375441 task-clock                #    0.281 CPUs utilized
                  11,632 context-switches          #    0.004 M/sec
                      49 CPU-migrations            #    0.017 K/sec
                     354 page-faults               #    0.124 K/sec
           7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
           6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
           1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
           2,079,702,453 instructions              #    0.27  insns per cycle
                                                   #    2.94  stalled cycles per insn [83.22%]
             363,773,213 branches                  #  127.757 M/sec                   [83.29%]
               4,242,732 branch-misses             #    1.17% of all branches         [83.51%]
      
            10.128449949 seconds time elapsed
      
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9f6e089c
    • Prarit Bhargava's avatar
      ACPI / memhotplug: add parameter to disable memory hotplug · e801ecec
      Prarit Bhargava authored
      commit 00159a20 upstream.
      
      When booting a kexec/kdump kernel on a system that has specific memory
      hotplug regions the boot will fail with warnings like:
      
       swapper/0: page allocation failure: order:9, mode:0x84d0
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-65.el7.x86_64 #1
       Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011
        0000000000000000 ffff8800341bd8c8 ffffffff815bcc67 ffff8800341bd950
        ffffffff8113b1a0 ffff880036339b00 0000000000000009 00000000000084d0
        ffff8800341bd950 ffffffff815b87ee 0000000000000000 0000000000000200
       Call Trace:
        [<ffffffff815bcc67>] dump_stack+0x19/0x1b
        [<ffffffff8113b1a0>] warn_alloc_failed+0xf0/0x160
        [<ffffffff815b87ee>] ?  __alloc_pages_direct_compact+0xac/0x196
        [<ffffffff8113f14f>] __alloc_pages_nodemask+0x7ff/0xa00
        [<ffffffff815b417c>] vmemmap_alloc_block+0x62/0xba
        [<ffffffff815b41e9>] vmemmap_alloc_block_buf+0x15/0x3b
        [<ffffffff815b1ff6>] vmemmap_populate+0xb4/0x21b
        [<ffffffff815b461d>] sparse_mem_map_populate+0x27/0x35
        [<ffffffff815b400f>] sparse_add_one_section+0x7a/0x185
        [<ffffffff815a1e9f>] __add_pages+0xaf/0x240
        [<ffffffff81047359>] arch_add_memory+0x59/0xd0
        [<ffffffff815a21d9>] add_memory+0xb9/0x1b0
        [<ffffffff81333b9c>] acpi_memory_device_add+0x18d/0x26d
        [<ffffffff81309a01>] acpi_bus_device_attach+0x7d/0xcd
        [<ffffffff8132379d>] acpi_ns_walk_namespace+0xc8/0x17f
        [<ffffffff81309984>] ? acpi_bus_type_and_status+0x90/0x90
        [<ffffffff81309984>] ? acpi_bus_type_and_status+0x90/0x90
        [<ffffffff81323c8c>] acpi_walk_namespace+0x95/0xc5
        [<ffffffff8130a6d6>] acpi_bus_scan+0x8b/0x9d
        [<ffffffff81a2019a>] acpi_scan_init+0x63/0x160
        [<ffffffff81a1ffb5>] acpi_init+0x25d/0x2a6
        [<ffffffff81a1fd58>] ? acpi_sleep_proc_init+0x2a/0x2a
        [<ffffffff810020e2>] do_one_initcall+0xe2/0x190
        [<ffffffff819e20c4>] kernel_init_freeable+0x17c/0x207
        [<ffffffff819e18d0>] ? do_early_param+0x88/0x88
        [<ffffffff8159fea0>] ? rest_init+0x80/0x80
        [<ffffffff8159feae>] kernel_init+0xe/0x180
        [<ffffffff815cca2c>] ret_from_fork+0x7c/0xb0
        [<ffffffff8159fea0>] ? rest_init+0x80/0x80
       Mem-Info:
       Node 0 DMA per-cpu:
       CPU    0: hi:    0, btch:   1 usd:   0
       Node 0 DMA32 per-cpu:
       CPU    0: hi:   42, btch:   7 usd:   0
       active_anon:0 inactive_anon:0 isolated_anon:0
        active_file:0 inactive_file:0 isolated_file:0
        unevictable:0 dirty:0 writeback:0 unstable:0
        free:872 slab_reclaimable:13 slab_unreclaimable:1880
        mapped:0 shmem:0 pagetables:0 bounce:0
        free_cma:0
      
      because the system has run out of memory at boot time.  This occurs
      because of the following sequence in the boot:
      
      Main kernel boots and sets E820 map.  The second kernel is booted with a
      map generated by the kdump service using memmap= and memmap=exactmap.
      These parameters are added to the kernel parameters of the kexec/kdump
      kernel.   The kexec/kdump kernel has limited memory resources so as not
      to severely impact the main kernel.
      
      The system then panics and the kdump/kexec kernel boots (which is a
      completely new kernel boot).  During this boot ACPI is initialized and the
      kernel (as can be seen above) traverses the ACPI namespace and finds an
      entry for a memory device to be hotadded.
      
      ie)
      
        [<ffffffff815a1e9f>] __add_pages+0xaf/0x240
        [<ffffffff81047359>] arch_add_memory+0x59/0xd0
        [<ffffffff815a21d9>] add_memory+0xb9/0x1b0
        [<ffffffff81333b9c>] acpi_memory_device_add+0x18d/0x26d
        [<ffffffff81309a01>] acpi_bus_device_attach+0x7d/0xcd
        [<ffffffff8132379d>] acpi_ns_walk_namespace+0xc8/0x17f
        [<ffffffff81309984>] ? acpi_bus_type_and_status+0x90/0x90
        [<ffffffff81309984>] ? acpi_bus_type_and_status+0x90/0x90
        [<ffffffff81323c8c>] acpi_walk_namespace+0x95/0xc5
        [<ffffffff8130a6d6>] acpi_bus_scan+0x8b/0x9d
        [<ffffffff81a2019a>] acpi_scan_init+0x63/0x160
        [<ffffffff81a1ffb5>] acpi_init+0x25d/0x2a6
      
      At this point the kernel adds page table information and the the kexec/kdump
      kernel runs out of memory.
      
      This can also be reproduced by using the memmap=exactmap and mem=X
      parameters on the main kernel and booting.
      
      This patchset resolves the problem by adding a kernel parameter,
      acpi_no_memhotplug, to disable ACPI memory hotplug.
      Signed-off-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e801ecec
    • Peter Zijlstra's avatar
      sched: Make scale_rt_power() deal with backward clocks · 5ff029e2
      Peter Zijlstra authored
      commit cadefd3d upstream.
      
      Mike reported that, while unlikely, its entirely possible for
      scale_rt_power() to see the time go backwards. This yields rather
      'interesting' results.
      
      So like all other sites that deal with clocks; make this one ignore
      backward clock movement too.
      Reported-by: default avatarMike Galbraith <bitbucket@online.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140227094035.GZ9987@twins.programming.kicks-ass.net
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5ff029e2
    • Wendy Xiong's avatar
      [SCSI] ipr: Add new CCIN definition for Grand Canyon support · dcc23f13
      Wendy Xiong authored
      commit 5eeac3e9 upstream.
      
      Add the appropriate definition and table entry for new hardware support.
      Signed-off-by: default avatarWen Xiong <wenxiong@linux.vnet.ibm.com>
      Acked-by: default avatarBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Parallels.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dcc23f13
    • Mike Qiu's avatar
      powerpc/mm: fix ".__node_distance" undefined · 311222a9
      Mike Qiu authored
      commit 12c743eb upstream.
      
        CHK     include/config/kernel.release
        CHK     include/generated/uapi/linux/version.h
        CHK     include/generated/utsrelease.h
        ...
        Building modules, stage 2.
      WARNING: 1 bad relocations
      c0000000013d6a30 R_PPC64_ADDR64    uprobes_fetch_type_table
        WRAP    arch/powerpc/boot/zImage.pseries
        WRAP    arch/powerpc/boot/zImage.epapr
        MODPOST 1849 modules
      ERROR: ".__node_distance" [drivers/block/nvme.ko] undefined!
      make[1]: *** [__modpost] Error 1
      make: *** [modules] Error 2
      make: *** Waiting for unfinished jobs....
      
      The reason is symbol "__node_distance" not been exported in powerpc.
      Signed-off-by: default avatarMike Qiu <qiudayu@linux.vnet.ibm.com>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Mike Qiu <qiudayu@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      311222a9
    • Petr Mladek's avatar
      ftrace/x86: Call text_ip_addr() instead of the duplicated code · 38a572d0
      Petr Mladek authored
      commit 964f7b6b upstream.
      
      I just went over this when looking at some Xen-related ftrace initialization
      problems. They were related to Xen code that is not upstream but this clean up
      would make sense here.
      
      I think that this was already the intention when text_ip_addr() was introduced
      in the commit 87fbb2ac (ftrace/x86: Use breakpoints for converting
      function graph caller). Anyway, better do it now before it shots people into
      their leg ;-)
      
      Link: http://lkml.kernel.org/p/1401812601-2359-1-git-send-email-pmladek@suse.czSigned-off-by: default avatarPetr Mladek <pmladek@suse.cz>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      38a572d0
    • J. Bruce Fields's avatar
      nfsd4: fix FREE_STATEID lockowner leak · 2eaaa8d2
      J. Bruce Fields authored
      commit 48385408 upstream.
      
      27b11428 ("nfsd4: remove lockowner when removing lock stateid")
      introduced a memory leak.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2eaaa8d2
    • Ying Xue's avatar
      tipc: fix memory leak of publications · 48195684
      Ying Xue authored
      commit 1621b94d upstream.
      
      Commit 1bb8dce5 ("tipc: fix memory
      leak during module removal") introduced a memory leak issue: when
      name table is stopped, it's forgotten that publication instances are
      freed properly. Additionally the useless "continue" statement in
      tipc_nametbl_stop() is removed as well.
      Reported-by: default avatarJason <huzhijiang@gmail.com>
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Acked-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      48195684
    • Jiang Liu's avatar
      intel_idle: close avn_cstates array with correct marker · 4c65d4f6
      Jiang Liu authored
      commit 88390996 upstream.
      
      Close avn_cstates array with correct marker to avoid overflow
      in function intel_idle_cpu_init().
      
      [rjw: The problem was introduced when commit 22e580d0 was merged
       on top of eba682a5 (intel_idle: shrink states tables).]
      
      Fixes: 22e580d0 (intel_idle: Fixed C6 state on Avoton/Rangeley processors)
      Signed-off-by: default avatarJiang Liu <jiang.liu@linux.intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4c65d4f6
    • Viresh Kumar's avatar
      tick-sched: Check tick_nohz_enabled in tick_nohz_switch_to_nohz() · e54e6e8e
      Viresh Kumar authored
      commit 27630532 upstream.
      
      Since commit d689fe22 (NOHZ: Check for nohz active instead of nohz
      enabled) the tick_nohz_switch_to_nohz() function returns because it
      checks for the tick_nohz_active flag. This can't be set, because the
      function itself sets it.
      
      Undo the change in tick_nohz_switch_to_nohz().
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Cc: linaro-kernel@lists.linaro.org
      Cc: fweisbec@gmail.com
      Cc: Arvind.Chauhan@arm.com
      Cc: linaro-networking@linaro.org
      Cc: <stable@vger.kernel.org> # 3.13+
      Link: http://lkml.kernel.org/r/40939c05f2d65d781b92b20302b02243d0654224.1397537987.git.viresh.kumar@linaro.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e54e6e8e
    • Konstantin Khlebnikov's avatar
      epoll: fix use-after-free in eventpoll_release_file · c79460f6
      Konstantin Khlebnikov authored
      commit ebe06187 upstream.
      
      This fixes use-after-free of epi->fllink.next inside list loop macro.
      This loop actually releases elements in the body.  The list is
      rcu-protected but here we cannot hold rcu_read_lock because we need to
      lock mutex inside.
      
      The obvious solution is to use list_for_each_entry_safe().  RCU-ness
      isn't essential because nobody can change this list under us, it's final
      fput for this file.
      
      The bug was introduced by ae10b2b4 ("epoll: optimize EPOLL_CTL_DEL
      using rcu")
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Reported-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Stable <stable@vger.kernel.org> # 3.13+
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c79460f6
    • Li Zhong's avatar
      powerpc: Fix Oops in rtas_stop_self() · f116dbc9
      Li Zhong authored
      commit 4fb8d027 upstream.
      
      commit 41dd03a9 may cause Oops in rtas_stop_self().
      
      The reason is that the rtas_args was moved into stack space. For a box
      with more that 4GB RAM, the stack could easily be outside 32bit range,
      but RTAS is 32bit.
      
      So the patch moves rtas_args away from stack by adding static before
      it.
      Signed-off-by: default avatarLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Cc: stable@vger.kernel.org # 3.14+
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f116dbc9
    • J. Bruce Fields's avatar
      GFS2: revert "GFS2: d_splice_alias() can't return error" · 22d3112a
      J. Bruce Fields authored
      commit d57b9c9a upstream.
      
      0d0d1107 asserts that "d_splice_alias()
      can't return error unless it was given an IS_ERR(inode)".
      
      That was true of the implementation of d_splice_alias, but this is
      really a problem with d_splice_alias: at a minimum it should be able to
      return -ELOOP in the case where inserting the given dentry would cause a
      directory loop.
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      22d3112a
    • Jiri Slaby's avatar
      Revert "bio-integrity: Fix bio_integrity_verify segment start bug" · 63a97385
      Jiri Slaby authored
      This reverts commit 7cbcb219,
      misapplied upstream commit 5837c80e.
      
      The upstream commit was applied twice to stable-3.12, the second time
      to bio_integrity_generate. Revert this second application.
      
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      63a97385
  2. 23 Jun, 2014 8 commits
    • Vincent Guittot's avatar
      Revert "sched: Fix sleep time double accounting in enqueue entity" · 61844d8e
      Vincent Guittot authored
      commit 9390675a upstream.
      
      This reverts commit 282cf499.
      
      With the current implementation, the load average statistics of a sched entity
      change according to other activity on the CPU even if this activity is done
      between the running window of the sched entity and have no influence on the
      running duration of the task.
      
      When a task wakes up on the same CPU, we currently update last_runnable_update
      with the return  of __synchronize_entity_decay without updating the
      runnable_avg_sum and runnable_avg_period accordingly. In fact, we have to sync
      the load_contrib of the se with the rq's blocked_load_contrib before removing
      it from the latter (with __synchronize_entity_decay) but we must keep
      last_runnable_update unchanged for updating runnable_avg_sum/period during the
      next update_entity_load_avg.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Cc: pjt@google.com
      Cc: alex.shi@linaro.org
      Link: http://lkml.kernel.org/r/1390376734-6800-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      61844d8e
    • Jiri Slaby's avatar
      Linux 3.12.23 · 85ee5c00
      Jiri Slaby authored
      85ee5c00
    • Boris BREZILLON's avatar
      ARM: at91: fix at91_sysirq_mask_rtc for sam9x5 SoCs · 8d8a2761
      Boris BREZILLON authored
      commit 9dcc87fe upstream.
      
      sam9x5 SoCs have the following errata:
       "RTC: Interrupt Mask Register cannot be used
        Interrupt Mask Register read always returns 0."
      
      Hence we should not rely on what IMR claims about already masked IRQs
      and just disable all IRQs.
      Signed-off-by: default avatarBoris BREZILLON <boris.brezillon@free-electrons.com>
      Reported-by: default avatarBryan Evenson <bevenson@melinkcorp.com>
      Reviewed-by: default avatarJohan Hovold <johan@hovold.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Cc: Bryan Evenson <bevenson@melinkcorp.com>
      Cc: Andrew Victor <linux@maxim.org.za>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Mark Roszko <mark.roszko@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8d8a2761
    • Cong Wang's avatar
      vxlan: use dev->needed_headroom instead of dev->hard_header_len · 0067390a
      Cong Wang authored
      [ Upstream commit 2853af6a ]
      
      When we mirror packets from a vxlan tunnel to other device,
      the mirror device should see the same packets (that is, without
      outer header). Because vxlan tunnel sets dev->hard_header_len,
      tcf_mirred() resets mac header back to outer mac, the mirror device
      actually sees packets with outer headers
      
      Vxlan tunnel should set dev->needed_headroom instead of
      dev->hard_header_len, like what other ip tunnels do. This fixes
      the above problem.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: stephen hemminger <stephen@networkplumber.org>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0067390a
    • Michal Schmidt's avatar
      rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 · 76acb942
      Michal Schmidt authored
      [ Upstream commit e5eca6d4 ]
      
      When running RHEL6 userspace on a current upstream kernel, "ip link"
      fails to show VF information.
      
      The reason is a kernel<->userspace API change introduced by commit
      88c5b5ce ("rtnetlink: Call nlmsg_parse() with correct header length"),
      after which the kernel does not see iproute2's IFLA_EXT_MASK attribute
      in the netlink request.
      
      iproute2 adjusted for the API change in its commit 63338dca4513
      ("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter").
      
      The problem has been noticed before:
      http://marc.info/?l=linux-netdev&m=136692296022182&w=2
      (Subject: Re: getting VF link info seems to be broken in 3.9-rc8)
      
      We can do better than tell those with old userspace to upgrade. We can
      recognize the old iproute2 in the kernel by checking the netlink message
      length. Even when including the IFLA_EXT_MASK attribute, its netlink
      message is shorter than struct ifinfomsg.
      
      With this patch "ip link" shows VF information in both old and new
      iproute2 versions.
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      76acb942
    • Xufeng Zhang's avatar
      sctp: Fix sk_ack_backlog wrap-around problem · ddb638e6
      Xufeng Zhang authored
      [ Upstream commit d3217b15 ]
      
      Consider the scenario:
      For a TCP-style socket, while processing the COOKIE_ECHO chunk in
      sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
      a new association would be created in sctp_unpack_cookie(), but afterwards,
      some processing maybe failed, and sctp_association_free() will be called to
      free the previously allocated association, in sctp_association_free(),
      sk_ack_backlog value is decremented for this socket, since the initial
      value for sk_ack_backlog is 0, after the decrement, it will be 65535,
      a wrap-around problem happens, and if we want to establish new associations
      afterward in the same socket, ABORT would be triggered since sctp deem the
      accept queue as full.
      Fix this issue by only decrementing sk_ack_backlog for associations in
      the endpoint's list.
      Fix-suggested-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarXufeng Zhang <xufeng.zhang@windriver.com>
      Acked-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ddb638e6
    • Eric Dumazet's avatar
      ipv4: fix a race in ip4_datagram_release_cb() · c671113b
      Eric Dumazet authored
      [ Upstream commit 9709674e ]
      
      Alexey gave a AddressSanitizer[1] report that finally gave a good hint
      at where was the origin of various problems already reported by Dormando
      in the past [2]
      
      Problem comes from the fact that UDP can have a lockless TX path, and
      concurrent threads can manipulate sk_dst_cache, while another thread,
      is holding socket lock and calls __sk_dst_set() in
      ip4_datagram_release_cb() (this was added in linux-3.8)
      
      It seems that all we need to do is to use sk_dst_check() and
      sk_dst_set() so that all the writers hold same spinlock
      (sk->sk_dst_lock) to prevent corruptions.
      
      TCP stack do not need this protection, as all sk_dst_cache writers hold
      the socket lock.
      
      [1]
      https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
      
      AddressSanitizer: heap-use-after-free in ipv4_dst_check
      Read of size 2 by thread T15453:
       [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
       [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
       [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
       [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
       [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      Freed by thread T15455:
       [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
       [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
       [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      Allocated by thread T15453:
       [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
       [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
       [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
      __mkroute_output ./net/ipv4/route.c:1939
       [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
       [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
       [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
       [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
       [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
       [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
       [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
      ./arch/x86/kernel/entry_64.S:629
      
      [2]
      <4>[196727.311203] general protection fault: 0000 [#1] SMP
      <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
      <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
      <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
      <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
      <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
      <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
      <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
      <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
      <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
      <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
      <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
      <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
      <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>[196727.311713] Stack:
      <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
      <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
      <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
      <4>[196727.311885] Call Trace:
      <4>[196727.311907]  <IRQ>
      <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
      <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
      <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
      <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
      <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
      <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
      <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
      <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
      <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
      <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
      <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
      <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
      <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
      <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
      <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
      <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
      <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
      <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
      <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
      <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
      <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
      <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
      <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
      <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
      <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
      <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
      <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
      <4>[196727.312722]  <EOI>
      <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
      <4>[196727.313100]  RSP <ffff885effd23a70>
      <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
      <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
      Reported-by: default avatarAlexey Preobrazhensky <preobr@google.com>
      Reported-by: default avatardormando <dormando@rydia.ne>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 8141ed9f ("ipv4: Add a socket release callback for datagram sockets")
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c671113b
    • Dmitry Popov's avatar
      ipip, sit: fix ipv4_{update_pmtu,redirect} calls · 4b5b1dd6
      Dmitry Popov authored
      [ Upstream commit 2346829e ]
      
      ipv4_{update_pmtu,redirect} were called with tunnel's ifindex (t->dev is a
      tunnel netdevice). It caused wrong route lookup and failure of pmtu update or
      redirect. We should use the same ifindex that we use in ip_route_output_* in
      *tunnel_xmit code. It is t->parms.link .
      Signed-off-by: default avatarDmitry Popov <ixaphire@qrator.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4b5b1dd6