1. 22 Mar, 2014 11 commits
  2. 21 Mar, 2014 2 commits
    • Rob Clark's avatar
      drm/ttm: don't oops if no invalidate_caches() · e00feda0
      Rob Clark authored
      commit 9ef7506f upstream.
      
      A few of the simpler TTM drivers (cirrus, ast, mgag200) do not implement
      this function.  Yet can end up somehow with an evicted bo:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<          (null)>]           (null)
        PGD 16e761067 PUD 16e6cf067 PMD 0
        Oops: 0010 [#1] SMP
        Modules linked in: bnep bluetooth rfkill fuse ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg btrfs zlib_deflate raid6_pq xor dm_queue_length iTCO_wdt iTCO_vendor_support coretemp kvm dcdbas dm_service_time microcode serio_raw pcspkr lpc_ich mfd_core i7core_edac edac_core ses enclosure ipmi_si ipmi_msghandler shpchp acpi_power_meter mperf nfsd auth_rpcgss nfs_acl lockd uinput sunrpc dm_multipath xfs libcrc32c ata_generic pata_acpi sr_mod cdrom
         sd_mod usb_storage mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit lpfc drm_kms_helper ttm crc32c_intel ata_piix bfa drm ixgbe libata i2c_core mdio crc_t10dif ptp crct10dif_common pps_core scsi_transport_fc dca scsi_tgt megaraid_sas bnx2 dm_mirror dm_region_hash dm_log dm_mod
        CPU: 16 PID: 2572 Comm: X Not tainted 3.10.0-86.el7.x86_64 #1
        Hardware name: Dell Inc. PowerEdge R810/0H235N, BIOS 0.3.0 11/14/2009
        task: ffff8801799dabc0 ti: ffff88016c884000 task.ti: ffff88016c884000
        RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
        RSP: 0018:ffff88016c885ad8  EFLAGS: 00010202
        RAX: ffffffffa04e94c0 RBX: ffff880178937a20 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000240004 RDI: ffff880178937a00
        RBP: ffff88016c885b60 R08: 00000000000171a0 R09: ffff88007cf171a0
        R10: ffffea0005842540 R11: ffffffff810487b9 R12: ffff880178937b30
        R13: ffff880178937a00 R14: ffff88016c885b78 R15: ffff880179929400
        FS:  00007f81ba2ef980(0000) GS:ffff88007cf00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000016e763000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Stack:
         ffffffffa0306fae ffff8801799295c0 0000000000260004 0000000000000001
         ffff88016c885b60 ffffffffa0307669 00ff88007cf17738 ffff88017cf17700
         ffff880178937a00 ffff880100000000 ffff880100000000 0000000079929400
        Call Trace:
         [<ffffffffa0306fae>] ? ttm_bo_handle_move_mem+0x54e/0x5b0 [ttm]
         [<ffffffffa0307669>] ? ttm_bo_mem_space+0x169/0x340 [ttm]
         [<ffffffffa0307bd7>] ttm_bo_move_buffer+0x117/0x130 [ttm]
         [<ffffffff81130001>] ? perf_event_init_context+0x141/0x220
         [<ffffffffa0307cb1>] ttm_bo_validate+0xc1/0x130 [ttm]
         [<ffffffffa04e7377>] mgag200_bo_pin+0x87/0xc0 [mgag200]
         [<ffffffffa04e56c4>] mga_crtc_cursor_set+0x474/0xbb0 [mgag200]
         [<ffffffff811971d2>] ? __mem_cgroup_commit_charge+0x152/0x3b0
         [<ffffffff815c4182>] ? mutex_lock+0x12/0x2f
         [<ffffffffa0201433>] drm_mode_cursor_common+0x123/0x170 [drm]
         [<ffffffffa0205231>] drm_mode_cursor_ioctl+0x41/0x50 [drm]
         [<ffffffffa01f5ca2>] drm_ioctl+0x502/0x630 [drm]
         [<ffffffff815cbab4>] ? __do_page_fault+0x1f4/0x510
         [<ffffffff8101cb68>] ? __restore_xstate_sig+0x218/0x4f0
         [<ffffffff811b4445>] do_vfs_ioctl+0x2e5/0x4d0
         [<ffffffff8124488e>] ? file_has_perm+0x8e/0xa0
         [<ffffffff811b46b1>] SyS_ioctl+0x81/0xa0
         [<ffffffff815d05d9>] system_call_fastpath+0x16/0x1b
        Code:  Bad RIP value.
        RIP  [<          (null)>]           (null)
         RSP <ffff88016c885ad8>
        CR2: 0000000000000000
      Signed-off-by: default avatarRob Clark <rclark@redhat.com>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarThomas Hellstrom <thellstrom@vmware.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e00feda0
    • Filipe Brandenburger's avatar
      memcg: reparent charges of children before processing parent · fda958d5
      Filipe Brandenburger authored
      commit 4fb1a86f upstream.
      
      Sometimes the cleanup after memcg hierarchy testing gets stuck in
      mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
      
      There may turn out to be several causes, but a major cause is this: the
      workitem to offline parent can get run before workitem to offline child;
      parent's mem_cgroup_reparent_charges() circles around waiting for the
      child's pages to be reparented to its lrus, but it's holding
      cgroup_mutex which prevents the child from reaching its
      mem_cgroup_reparent_charges().
      
      Further testing showed that an ordered workqueue for cgroup_destroy_wq
      is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
      stage on the way can mess up the order before reaching the workqueue.
      
      Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
      all its children (and grandchildren, in the correct order) to have their
      charges reparented first.
      
      Fixes: e5fca243 ("cgroup: use a dedicated workqueue for cgroup destruction")
      Signed-off-by: default avatarFilipe Brandenburger <filbranden@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fda958d5
  3. 14 Mar, 2014 4 commits
  4. 13 Mar, 2014 11 commits
    • Daniel Borkmann's avatar
      net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable · 00c53b02
      Daniel Borkmann authored
      [ Upstream commit ec0223ec ]
      
      RFC4895 introduced AUTH chunks for SCTP; during the SCTP
      handshake RANDOM; CHUNKS; HMAC-ALGO are negotiated (CHUNKS
      being optional though):
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
      
      A special case is when an endpoint requires COOKIE-ECHO
      chunks to be authenticated:
      
        ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
        <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
        ------------------ AUTH; COOKIE-ECHO ---------------->
        <-------------------- COOKIE-ACK ---------------------
      
      RFC4895, section 6.3. Receiving Authenticated Chunks says:
      
        The receiver MUST use the HMAC algorithm indicated in
        the HMAC Identifier field. If this algorithm was not
        specified by the receiver in the HMAC-ALGO parameter in
        the INIT or INIT-ACK chunk during association setup, the
        AUTH chunk and all the chunks after it MUST be discarded
        and an ERROR chunk SHOULD be sent with the error cause
        defined in Section 4.1. [...] If no endpoint pair shared
        key has been configured for that Shared Key Identifier,
        all authenticated chunks MUST be silently discarded. [...]
      
        When an endpoint requires COOKIE-ECHO chunks to be
        authenticated, some special procedures have to be followed
        because the reception of a COOKIE-ECHO chunk might result
        in the creation of an SCTP association. If a packet arrives
        containing an AUTH chunk as a first chunk, a COOKIE-ECHO
        chunk as the second chunk, and possibly more chunks after
        them, and the receiver does not have an STCB for that
        packet, then authentication is based on the contents of
        the COOKIE-ECHO chunk. In this situation, the receiver MUST
        authenticate the chunks in the packet by using the RANDOM
        parameters, CHUNKS parameters and HMAC_ALGO parameters
        obtained from the COOKIE-ECHO chunk, and possibly a local
        shared secret as inputs to the authentication procedure
        specified in Section 6.3. If authentication fails, then
        the packet is discarded. If the authentication is successful,
        the COOKIE-ECHO and all the chunks after the COOKIE-ECHO
        MUST be processed. If the receiver has an STCB, it MUST
        process the AUTH chunk as described above using the STCB
        from the existing association to authenticate the
        COOKIE-ECHO chunk and all the chunks after it. [...]
      
      Commit bbd0d598 introduced the possibility to receive
      and verification of AUTH chunk, including the edge case for
      authenticated COOKIE-ECHO. On reception of COOKIE-ECHO,
      the function sctp_sf_do_5_1D_ce() handles processing,
      unpacks and creates a new association if it passed sanity
      checks and also tests for authentication chunks being
      present. After a new association has been processed, it
      invokes sctp_process_init() on the new association and
      walks through the parameter list it received from the INIT
      chunk. It checks SCTP_PARAM_RANDOM, SCTP_PARAM_HMAC_ALGO
      and SCTP_PARAM_CHUNKS, and copies them into asoc->peer
      meta data (peer_random, peer_hmacs, peer_chunks) in case
      sysctl -w net.sctp.auth_enable=1 is set. If in INIT's
      SCTP_PARAM_SUPPORTED_EXT parameter SCTP_CID_AUTH is set,
      peer_random != NULL and peer_hmacs != NULL the peer is to be
      assumed asoc->peer.auth_capable=1, in any other case
      asoc->peer.auth_capable=0.
      
      Now, if in sctp_sf_do_5_1D_ce() chunk->auth_chunk is
      available, we set up a fake auth chunk and pass that on to
      sctp_sf_authenticate(), which at latest in
      sctp_auth_calculate_hmac() reliably dereferences a NULL pointer
      at position 0..0008 when setting up the crypto key in
      crypto_hash_setkey() by using asoc->asoc_shared_key that is
      NULL as condition key_id == asoc->active_key_id is true if
      the AUTH chunk was injected correctly from remote. This
      happens no matter what net.sctp.auth_enable sysctl says.
      
      The fix is to check for net->sctp.auth_enable and for
      asoc->peer.auth_capable before doing any operations like
      sctp_sf_authenticate() as no key is activated in
      sctp_auth_asoc_init_active_key() for each case.
      
      Now as RFC4895 section 6.3 states that if the used HMAC-ALGO
      passed from the INIT chunk was not used in the AUTH chunk, we
      SHOULD send an error; however in this case it would be better
      to just silently discard such a maliciously prepared handshake
      as we didn't even receive a parameter at all. Also, as our
      endpoint has no shared key configured, section 6.3 says that
      MUST silently discard, which we are doing from now onwards.
      
      Before calling sctp_sf_pdiscard(), we need not only to free
      the association, but also the chunk->auth_chunk skb, as
      commit bbd0d598 created a skb clone in that case.
      
      I have tested this locally by using netfilter's nfqueue and
      re-injecting packets into the local stack after maliciously
      modifying the INIT chunk (removing RANDOM; HMAC-ALGO param)
      and the SCTP packet containing the COOKIE_ECHO (injecting
      AUTH chunk before COOKIE_ECHO). Fixed with this patch applied.
      
      Fixes: bbd0d598 ("[SCTP]: Implement the receive and verification of AUTH chunk")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Vlad Yasevich <yasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      00c53b02
    • Xin Long's avatar
      ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer · 2a01e9e4
      Xin Long authored
      [ Upstream commit 10ddceb2 ]
      
      when ip_tunnel process multicast packets, it may check if the packet is looped
      back packet though 'rt_is_output_route(skb_rtable(skb))' in ip_tunnel_rcv(),
      but before that , skb->_skb_refdst has been dropped in iptunnel_pull_header(),
      so which leads to a panic.
      
      fix the bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2a01e9e4
    • Michael Chan's avatar
      tg3: Don't check undefined error bits in RXBD · baef79a5
      Michael Chan authored
      [ Upstream commit d7b95315 ]
      
      Redefine the RXD_ERR_MASK to include only relevant error bits. This fixes
      a customer reported issue of randomly dropping packets on the 5719.
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      baef79a5
    • Hans Schillstrom's avatar
      ipv6: ipv6_find_hdr restore prev functionality · 537fc749
      Hans Schillstrom authored
      [ Upstream commit accfe0e3 ]
      
      The commit 9195bb8e ("ipv6: improve
      ipv6_find_hdr() to skip empty routing headers") broke ipv6_find_hdr().
      
      When a target is specified like IPPROTO_ICMPV6 ipv6_find_hdr()
      returns -ENOENT when it's found, not the header as expected.
      
      A part of IPVS is broken and possible also nft_exthdr_eval().
      When target is -1 which it is most cases, it works.
      
      This patch exits the do while loop if the specific header is found
      so the nexthdr could be returned as expected.
      Reported-by: default avatarArt -kwaak- van Breemen <ard@telegraafnet.nl>
      Signed-off-by: default avatarHans Schillstrom <hans@schillstrom.com>
      CC:Ansis Atteka <aatteka@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      537fc749
    • Edward Cree's avatar
      sfc: check for NULL efx->ptp_data in efx_ptp_event · 1c026764
      Edward Cree authored
      [ Upstream commit 8f355e5c ]
      
      If we receive a PTP event from the NIC when we haven't set up PTP state
      in the driver, we attempt to read through a NULL pointer efx->ptp_data,
      triggering a panic.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Acked-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1c026764
    • Hannes Frederic Sowa's avatar
      ipv6: reuse ip6_frag_id from ip6_ufo_append_data · 3bbb02a1
      Hannes Frederic Sowa authored
      [ Upstream commit 916e4cf4 ]
      
      Currently we generate a new fragmentation id on UFO segmentation. It
      is pretty hairy to identify the correct net namespace and dst there.
      Especially tunnels use IFF_XMIT_DST_RELEASE and thus have no skb_dst
      available at all.
      
      This causes unreliable or very predictable ipv6 fragmentation id
      generation while segmentation.
      
      Luckily we already have pregenerated the ip6_frag_id in
      ip6_ufo_append_data and can use it here.
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3bbb02a1
    • Jason Wang's avatar
      virtio-net: alloc big buffers also when guest can receive UFO · 251ed2ca
      Jason Wang authored
      [ Upstream commit 0e7ede80 ]
      
      We should alloc big buffers also when guest can receive UFO
      packets to let the big packets fit into guest rx buffer.
      
      Fixes 5c516751
      (virtio-net: Allow UFO feature to be set and advertised.)
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Sridhar Samudrala <sri@us.ibm.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      251ed2ca
    • Duan Jiong's avatar
      neigh: recompute reachabletime before returning from neigh_periodic_work() · df3839f9
      Duan Jiong authored
      [ Upstream commit feff9ab2 ]
      
      If the neigh table's entries is less than gc_thresh1, the function
      will return directly, and the reachabletime will not be recompute,
      so the reachabletime can be guessed.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      df3839f9
    • Eric Dumazet's avatar
      net-tcp: fastopen: fix high order allocations · 10bbbd58
      Eric Dumazet authored
      [ Upstream commit f5ddcbbb ]
      
      This patch fixes two bugs in fastopen :
      
      1) The tcp_sendmsg(...,  @size) argument was ignored.
      
         Code was relying on user not fooling the kernel with iovec mismatches
      
      2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5
      allocations, which are likely to fail when memory gets fragmented.
      
      Fixes: 783237e8 ("net-tcp: Fast Open client - sending SYN-data")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Tested-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      10bbbd58
    • Fernando Luis Vazquez Cao's avatar
      tun: remove bogus hardware vlan acceleration flags from vlan_features · cbf5a3f6
      Fernando Luis Vazquez Cao authored
      [ Upstream commit 6671b224 ]
      
      Even though only the outer vlan tag can be HW accelerated in the transmission
      path, in the TUN/TAP driver vlan_features mirrors hw_features, which happens
      to have the NETIF_F_HW_VLAN_?TAG_TX flags set. Because of this, during packet
      tranmisssion through a stacked vlan device dev_hard_start_xmit, (incorrectly)
      assuming that the vlan device supports hardware vlan acceleration, does not
      add the vlan header to the skb payload and the inner vlan tags are lost
      (vlan_tci contains the outer vlan tag when userspace reads the packet from
      the tap device).
      Signed-off-by: default avatarFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cbf5a3f6
    • Toshiaki Makita's avatar
      veth: Fix vlan_features so as to be able to use stacked vlan interfaces · cc9f71b2
      Toshiaki Makita authored
      [ Upstream commit 8d0d21f4 ]
      
      Even if we create a stacked vlan interface such as veth0.10.20, it sends
      single tagged frames (tagged with only vid 10).
      Because vlan_features of a veth interface has the
      NETIF_F_HW_VLAN_[CTAG/STAG]_TX bits, veth0.10 also has that feature, so
      dev_hard_start_xmit(veth0.10) doesn't call __vlan_put_tag() and
      vlan_dev_hard_start_xmit(veth0.10) overwrites vlan_tci.
      This prevents us from using a combination of 802.1ad and 802.1Q
      in containers, etc.
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarFlavio Leitner <fbl@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cc9f71b2
  5. 12 Mar, 2014 12 commits
    • Waiman Long's avatar
      SELinux: Increase ebitmap_node size for 64-bit configuration · 17d633ca
      Waiman Long authored
      commit a767f680 upstream.
      
      Currently, the ebitmap_node structure has a fixed size of 32 bytes. On
      a 32-bit system, the overhead is 8 bytes, leaving 24 bytes for being
      used as bitmaps. The overhead ratio is 1/4.
      
      On a 64-bit system, the overhead is 16 bytes. Therefore, only 16 bytes
      are left for bitmap purpose and the overhead ratio is 1/2. With a
      3.8.2 kernel, a boot-up operation will cause the ebitmap_get_bit()
      function to be called about 9 million times. The average number of
      ebitmap_node traversal is about 3.7.
      
      This patch increases the size of the ebitmap_node structure to 64
      bytes for 64-bit system to keep the overhead ratio at 1/4. This may
      also improve performance a little bit by making node to node traversal
      less frequent (< 2) as more bits are available in each node.
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      17d633ca
    • Waiman Long's avatar
      SELinux: Reduce overhead of mls_level_isvalid() function call · 3b32517b
      Waiman Long authored
      commit fee71142 upstream.
      
      While running the high_systime workload of the AIM7 benchmark on
      a 2-socket 12-core Westmere x86-64 machine running 3.10-rc4 kernel
      (with HT on), it was found that a pretty sizable amount of time was
      spent in the SELinux code. Below was the perf trace of the "perf
      record -a -s" of a test run at 1500 users:
      
        5.04%            ls  [kernel.kallsyms]     [k] ebitmap_get_bit
        1.96%            ls  [kernel.kallsyms]     [k] mls_level_isvalid
        1.95%            ls  [kernel.kallsyms]     [k] find_next_bit
      
      The ebitmap_get_bit() was the hottest function in the perf-report
      output.  Both the ebitmap_get_bit() and find_next_bit() functions
      were, in fact, called by mls_level_isvalid(). As a result, the
      mls_level_isvalid() call consumed 8.95% of the total CPU time of
      all the 24 virtual CPUs which is quite a lot. The majority of the
      mls_level_isvalid() function invocations come from the socket creation
      system call.
      
      Looking at the mls_level_isvalid() function, it is checking to see
      if all the bits set in one of the ebitmap structure are also set in
      another one as well as the highest set bit is no bigger than the one
      specified by the given policydb data structure. It is doing it in
      a bit-by-bit manner. So if the ebitmap structure has many bits set,
      the iteration loop will be done many times.
      
      The current code can be rewritten to use a similar algorithm as the
      ebitmap_contains() function with an additional check for the
      highest set bit. The ebitmap_contains() function was extended to
      cover an optional additional check for the highest set bit, and the
      mls_level_isvalid() function was modified to call ebitmap_contains().
      
      With that change, the perf trace showed that the used CPU time drop
      down to just 0.08% (ebitmap_contains + mls_level_isvalid) of the
      total which is about 100X less than before.
      
        0.07%            ls  [kernel.kallsyms]     [k] ebitmap_contains
        0.05%            ls  [kernel.kallsyms]     [k] ebitmap_get_bit
        0.01%            ls  [kernel.kallsyms]     [k] mls_level_isvalid
        0.01%            ls  [kernel.kallsyms]     [k] find_next_bit
      
      The remaining ebitmap_get_bit() and find_next_bit() functions calls
      are made by other kernel routines as the new mls_level_isvalid()
      function will not call them anymore.
      
      This patch also improves the high_systime AIM7 benchmark result,
      though the improvement is not as impressive as is suggested by the
      reduction in CPU time spent in the ebitmap functions. The table below
      shows the performance change on the 2-socket x86-64 system (with HT
      on) mentioned above.
      
      +--------------+---------------+----------------+-----------------+
      |   Workload   | mean % change | mean % change  | mean % change   |
      |              | 10-100 users  | 200-1000 users | 1100-2000 users |
      +--------------+---------------+----------------+-----------------+
      | high_systime |     +0.1%     |     +0.9%      |     +2.6%       |
      +--------------+---------------+----------------+-----------------+
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3b32517b
    • Mel Gorman's avatar
      mm: do not walk all of system memory during show_mem · 67e204a0
      Mel Gorman authored
      commit c78e9363 upstream.
      
      It has been reported on very large machines that show_mem is taking almost
      5 minutes to display information.  This is a serious problem if there is
      an OOM storm.  The bulk of the cost is in show_mem doing a very expensive
      PFN walk to give us the following information
      
        Total RAM:       Also available as totalram_pages
        Highmem pages:   Also available as totalhigh_pages
        Reserved pages:  Can be inferred from the zone structure
        Shared pages:    PFN walk required
        Unshared pages:  PFN walk required
        Quick pages:     Per-cpu walk required
      
      Only the shared/unshared pages requires a full PFN walk but that
      information is useless.  It is also inaccurate as page pins of unshared
      pages would be accounted for as shared.  Even if the information was
      accurate, I'm struggling to think how the shared/unshared information
      could be useful for debugging OOM conditions.  Maybe it was useful before
      rmap existed when reclaiming shared pages was costly but it is less
      relevant today.
      
      The PFN walk could be optimised a bit but why bother as the information is
      useless.  This patch deletes the PFN walker and infers the total RAM,
      highmem and reserved pages count from struct zone.  It omits the
      shared/unshared page usage on the grounds that it is useless.  It also
      corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE
      has similar problems to HighMem with respect to lowmem/highmem exhaustion.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      67e204a0
    • Jason Baron's avatar
      epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL · 230a558c
      Jason Baron authored
      commit 4ff36ee9 upstream.
      
      The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
      That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
      epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
      commmit 67347fe4 ("epoll: do not take global 'epmutex' for simple
      topologies").
      
      The acquistion of the ep->mtx for the destination 'ep' was added such
      that a concurrent EPOLL_CTL_ADD operation would see the correct state of
      the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')
      
      However, by simply not acquiring the lock, we do not serialize behind
      the ep->mtx from the add path, and thus may perform a full path check
      when if we had waited a little longer it may not have been necessary.
      However, this is a transient state, and performing the full loop
      checking in this case is not harmful.
      
      The important point is that we wouldn't miss doing the full loop
      checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
      its operating upon.  The reason we don't need to do lock ordering in the
      add path, is that we are already are holding the global 'epmutex'
      whenever we do the double lock.  Further, the original posting of this
      patch, which was tested for the intended performance gains, did not
      perform this additional locking.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      230a558c
    • Jason Baron's avatar
      epoll: do not take global 'epmutex' for simple topologies · 107c1943
      Jason Baron authored
      commit 67347fe4 upstream.
      
      When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
      directly to a wakeup source, we do not need to take the global 'epmutex',
      unless the epoll file descriptor is nested.  The purpose of taking the
      'epmutex' on add is to prevent complex topologies such as loops and deep
      wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
      operations.  However, for the simple case of an epoll file descriptor
      attached directly to a wakeup source (with no nesting), we do not need to
      hold the 'epmutex'.
      
      This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
      scalability on larger systems.  Quoting Nathan Zimmer's mail on SPECjbb
      performance:
      
      "On the 16 socket run the performance went from 35k jOPS to 125k jOPS.  In
      addition the benchmark when from scaling well on 10 sockets to scaling
      well on just over 40 sockets.
      
      ...
      
      Currently the benchmark stops scaling at around 40-44 sockets but it seems like
      I found a second unrelated bottleneck."
      
      [akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      107c1943
    • Jason Baron's avatar
      epoll: optimize EPOLL_CTL_DEL using rcu · 81ff0d3b
      Jason Baron authored
      commit ae10b2b4 upstream.
      
      Nathan Zimmer found that once we get over 10+ cpus, the scalability of
      SPECjbb falls over due to the contention on the global 'epmutex', which is
      taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.
      
      Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
      by using rcu to guard against any concurrent traversals.
      
      Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
      simple topologies.  IE when adding a link from an epoll file descriptor to
      a wakeup source, where the epoll file descriptor is not nested.
      
      This patch (of 2):
      
      Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
      converting the file->f_ep_links list into an rcu one.  In this way, we can
      traverse the epoll network on the add path in parallel with deletes.
      Since deletes can't create loops or worse wakeup paths, this is safe.
      
      This patch in combination with the patch "epoll: Do not take global 'epmutex'
      for simple topologies", shows a dramatic performance improvement in
      scalability for SPECjbb.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Tested-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      CC: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      81ff0d3b
    • Jiri Slaby's avatar
      x86/dumpstack: Fix printk_address for direct addresses · dd7ab812
      Jiri Slaby authored
      commit 5f01c988 upstream.
      
      Consider a kernel crash in a module, simulated the following way:
      
       static int my_init(void)
       {
               char *map = (void *)0x5;
               *map = 3;
               return 0;
       }
       module_init(my_init);
      
      When we turn off FRAME_POINTERs, the very first instruction in
      that function causes a BUG. The problem is that we print IP in
      the BUG report using %pB (from printk_address). And %pB
      decrements the pointer by one to fix printing addresses of
      functions with tail calls.
      
      This was added in commit 71f9e598 ("x86, dumpstack: Use
      %pB format specifier for stack trace") to fix the call stack
      printouts.
      
      So instead of correct output:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000005
        IP: [<ffffffffa01ac000>] my_init+0x0/0x10 [pb173]
      
      We get:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000005
        IP: [<ffffffffa0152000>] 0xffffffffa0151fff
      
      To fix that, we use %pS only for stack addresses printouts (via
      newly added printk_stack_address) and %pB for regs->ip (via
      printk_address). I.e. we revert to the old behaviour for all
      except call stacks. And since from all those reliable is 1, we
      remove that parameter from printk_address.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: joe@perches.com
      Cc: jirislaby@gmail.com
      Link: http://lkml.kernel.org/r/1382706418-8435-1-git-send-email-jslaby@suse.czSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dd7ab812
    • NeilBrown's avatar
      SUNRPC: close a rare race in xs_tcp_setup_socket. · 6456bd97
      NeilBrown authored
      commit 93dc41bd upstream.
      
      We have one report of a crash in xs_tcp_setup_socket.
      The call path to the crash is:
      
        xs_tcp_setup_socket -> inet_stream_connect -> lock_sock_nested.
      
      The 'sock' passed to that last function is NULL.
      
      The only way I can see this happening is a concurrent call to
      xs_close:
      
        xs_close -> xs_reset_transport -> sock_release -> inet_release
      
      inet_release sets:
         sock->sk = NULL;
      inet_stream_connect calls
         lock_sock(sock->sk);
      which gets NULL.
      
      All calls to xs_close are protected by XPRT_LOCKED as are most
      activations of the workqueue which runs xs_tcp_setup_socket.
      The exception is xs_tcp_schedule_linger_timeout.
      
      So presumably the timeout queued by the later fires exactly when some
      other code runs xs_close().
      
      To protect against this we can move the cancel_delayed_work_sync()
      call from xs_destory() to xs_close().
      
      As xs_close is never called from the worker scheduled on
      ->connect_worker, this can never deadlock.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      [Trond: Make it safe to call cancel_delayed_work_sync() on AF_LOCAL sockets]
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6456bd97
    • Shawn Bohrer's avatar
      sched/rt: Remove redundant nr_cpus_allowed test · 4aef0b11
      Shawn Bohrer authored
      commit 6bfa687c upstream.
      
      In 76854c7e ("sched: Use
      rt.nr_cpus_allowed to recover select_task_rq() cycles") an
      optimization was added to select_task_rq_rt() that immediately
      returns when p->nr_cpus_allowed == 1 at the beginning of the
      function.
      
      This makes the latter p->nr_cpus_allowed > 1 check redundant,
      which can now be removed.
      Signed-off-by: default avatarShawn Bohrer <sbohrer@rgmadvisors.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: tomk@rgmadvisors.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4aef0b11
    • Peter Zijlstra's avatar
      sched/rt: Add missing rmb() · d171cbfc
      Peter Zijlstra authored
      commit 7c3f2ab7 upstream.
      
      While discussing the proposed SCHED_DEADLINE patches which in parts
      mimic the existing FIFO code it was noticed that the wmb in
      rt_set_overloaded() didn't have a matching barrier.
      
      The only site using rt_overloaded() to test the rto_count is
      pull_rt_task() and we should issue a matching rmb before then assuming
      there's an rto_mask bit set.
      
      Without that smp_rmb() in there we could actually miss seeing the
      rto_mask bit.
      
      Also, change to using smp_[wr]mb(), even though this is SMP only code;
      memory barriers without smp_ always make me think they're against
      hardware of some sort.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: vincent.guittot@linaro.org
      Cc: luca.abeni@unitn.it
      Cc: bruce.ashfield@windriver.com
      Cc: dhaval.giani@gmail.com
      Cc: rostedt@goodmis.org
      Cc: hgu1972@gmail.com
      Cc: oleg@redhat.com
      Cc: fweisbec@gmail.com
      Cc: darren@dvhart.com
      Cc: johan.eker@ericsson.com
      Cc: p.faure@akatech.ch
      Cc: paulmck@linux.vnet.ibm.com
      Cc: raistlin@linux.it
      Cc: claudio@evidence.eu.com
      Cc: insop.song@gmail.com
      Cc: michael@amarulasolutions.com
      Cc: liming.wang@windriver.com
      Cc: fchecconi@gmail.com
      Cc: jkacur@redhat.com
      Cc: tommaso.cucinotta@sssup.it
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: harald.gustafsson@ericsson.com
      Cc: nicola.manica@disi.unitn.it
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d171cbfc
    • Mel Gorman's avatar
      sched: Assign correct scheduling domain to 'sd_llc' · 4edad9c1
      Mel Gorman authored
      commit 5d4cf996 upstream.
      
      Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
      dereference on sd_busy but the fix also altered what scheduling domain it
      used for the 'sd_llc' percpu variable.
      
      One impact of this is that a task selecting a runqueue may consider
      idle CPUs that are not cache siblings as candidates for running.
      Tasks are then running on CPUs that are not cache hot.
      
      This was found through bisection where ebizzy threads were not seeing equal
      performance and it looked like a scheduling fairness issue. This patch
      mitigates but does not completely fix the problem on all machines tested
      implying there may be an additional bug or a common root cause. Here are
      the average range of performance seen by individual ebizzy threads. It
      was tested on top of candidate patches related to x86 TLB range flushing.
      
      	4-core machine
      			    3.13.0-rc3            3.13.0-rc3
      			       vanilla            fixsd-v3r3
      	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
      	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
      	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
      	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
      	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
      	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
      	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)
      
      	8-core machine
      	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
      	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
      	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
      	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
      	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
      	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
      	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
      	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
      	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)
      
      It's eliminated for one machine and reduced for another.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: H Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4edad9c1
    • Peter Zijlstra's avatar
      sched: Initialize power_orig for overlapping groups · 8dc051a7
      Peter Zijlstra authored
      commit 8e8339a3 upstream.
      
      Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
      Make sure to always initialize power_orig now that we actually use it.
      
      Ideally build_sched_domains() -> init_sched_groups_power() would also
      initialize this; but for some yet unexplained reason some setups seem
      to miss updates there.
      Reported-by: default avatarYinghai Lu <yinghai@kernel.org>
      Tested-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8dc051a7