1. 14 Aug, 2014 2 commits
    • Eric Dumazet's avatar
      ip: make IP identifiers less predictable · 509a15a5
      Eric Dumazet authored
      [ Upstream commit 04ca6973 ]
      
      In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
      Jedidiah describe ways exploiting linux IP identifier generation to
      infer whether two machines are exchanging packets.
      
      With commit 73f156a6 ("inetpeer: get rid of ip_id_count"), we
      changed IP id generation, but this does not really prevent this
      side-channel technique.
      
      This patch adds a random amount of perturbation so that IP identifiers
      for a given destination [1] are no longer monotonically increasing after
      an idle period.
      
      Note that prandom_u32_max(1) returns 0, so if generator is used at most
      once per jiffy, this patch inserts no hole in the ID suite and do not
      increase collision probability.
      
      This is jiffies based, so in the worst case (HZ=1000), the id can
      rollover after ~65 seconds of idle time, which should be fine.
      
      We also change the hash used in __ip_select_ident() to not only hash
      on daddr, but also saddr and protocol, so that ICMP probes can not be
      used to infer information for other protocols.
      
      For IPv6, adds saddr into the hash as well, but not nexthdr.
      
      If I ping the patched target, we can see ID are now hard to predict.
      
      21:57:11.008086 IP (...)
          A > target: ICMP echo request, seq 1, length 64
      21:57:11.010752 IP (... id 2081 ...)
          target > A: ICMP echo reply, seq 1, length 64
      
      21:57:12.013133 IP (...)
          A > target: ICMP echo request, seq 2, length 64
      21:57:12.015737 IP (... id 3039 ...)
          target > A: ICMP echo reply, seq 2, length 64
      
      21:57:13.016580 IP (...)
          A > target: ICMP echo request, seq 3, length 64
      21:57:13.019251 IP (... id 3437 ...)
          target > A: ICMP echo reply, seq 3, length 64
      
      [1] TCP sessions uses a per flow ID generator not changed by this patch.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJeffrey Knockel <jeffk@cs.unm.edu>
      Reported-by: default avatarJedidiah R. Crandall <crandall@cs.unm.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hannes Frederic Sowa <hannes@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      509a15a5
    • Eric Dumazet's avatar
      inetpeer: get rid of ip_id_count · ad52eef5
      Eric Dumazet authored
      [ Upstream commit 73f156a6 ]
      
      Ideally, we would need to generate IP ID using a per destination IP
      generator.
      
      linux kernels used inet_peer cache for this purpose, but this had a huge
      cost on servers disabling MTU discovery.
      
      1) each inet_peer struct consumes 192 bytes
      
      2) inetpeer cache uses a binary tree of inet_peer structs,
         with a nominal size of ~66000 elements under load.
      
      3) lookups in this tree are hitting a lot of cache lines, as tree depth
         is about 20.
      
      4) If server deals with many tcp flows, we have a high probability of
         not finding the inet_peer, allocating a fresh one, inserting it in
         the tree with same initial ip_id_count, (cf secure_ip_id())
      
      5) We garbage collect inet_peer aggressively.
      
      IP ID generation do not have to be 'perfect'
      
      Goal is trying to avoid duplicates in a short period of time,
      so that reassembly units have a chance to complete reassembly of
      fragments belonging to one message before receiving other fragments
      with a recycled ID.
      
      We simply use an array of generators, and a Jenkin hash using the dst IP
      as a key.
      
      ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
      belongs (it is only used from this file)
      
      secure_ip_id() and secure_ipv6_id() no longer are needed.
      
      Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
      unnecessary decrement/increment of the number of segments.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ad52eef5
  2. 07 Aug, 2014 20 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.4.102 · 0a9d91dc
      Greg Kroah-Hartman authored
      0a9d91dc
    • Gao feng's avatar
      ipv6: reallocate addrconf router for ipv6 address when lo device up · a066704c
      Gao feng authored
      commit 33d99113 upstream.
      
      commit 25fb6ca4
      "net IPv6 : Fix broken IPv6 routing table after loopback down-up"
      allocates addrconf router for ipv6 address when lo device up.
      but commit a881ae1f
      "ipv6:don't call addrconf_dst_alloc again when enable lo" breaks
      this behavior.
      
      Since the addrconf router is moved to the garbage list when
      lo device down, we should release this router and rellocate
      a new one for ipv6 address when lo device up.
      
      This patch solves bug 67951 on bugzilla
      https://bugzilla.kernel.org/show_bug.cgi?id=67951
      
      change from v1:
      use ip6_rt_put to repleace ip6_del_rt, thanks Hannes!
      change code style, suggested by Sergei.
      
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: default avatarWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: default avatarWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: default avatarGao feng <gaofeng@cn.fujitsu.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [weilong: s/ip6_rt_put/dst_release]
      Signed-off-by: default avatarChen Weilong <chenweilong@huawei.com>
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a066704c
    • Vlastimil Babka's avatar
      mm: try_to_unmap_cluster() should lock_page() before mlocking · 21dbd45f
      Vlastimil Babka authored
      commit 57e68e9c upstream.
      
      A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
      fuzzing with trinity.  The call site try_to_unmap_cluster() does not lock
      the pages other than its check_page parameter (which is already locked).
      
      The BUG_ON in mlock_vma_page() is not documented and its purpose is
      somewhat unclear, but apparently it serializes against page migration,
      which could otherwise fail to transfer the PG_mlocked flag.  This would
      not be fatal, as the page would be eventually encountered again, but
      NR_MLOCK accounting would become distorted nevertheless.  This patch adds
      a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
      effect.
      
      The call site try_to_unmap_cluster() is fixed so that for page !=
      check_page, trylock_page() is attempted (to avoid possible deadlocks as we
      already have check_page locked) and mlock_vma_page() is performed only
      upon success.  If the page lock cannot be obtained, the page is left
      without PG_mlocked, which is again not a problem in the whole unevictable
      memory design.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Cc: Yijing Wang <wangyijing@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21dbd45f
    • Boris Ostrovsky's avatar
      x86/espfix/xen: Fix allocation of pages for paravirt page tables · 9c982680
      Boris Ostrovsky authored
      commit 8762e509 upstream.
      
      init_espfix_ap() is currently off by one level when informing hypervisor
      that allocated pages will be used for ministacks' page tables.
      
      The most immediate effect of this on a PV guest is that if
      'stack_page = __get_free_page()' returns a non-zeroed-out page the hypervisor
      will refuse to use it for a page table (which it shouldn't be anyway). This will
      result in warnings by both Xen and Linux.
      
      More importantly, a subsequent write to that page (again, by a PV guest) is
      likely to result in fatal page fault.
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Link: http://lkml.kernel.org/r/1404926298-5565-1-git-send-email-boris.ostrovsky@oracle.comReviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c982680
    • Minfei Huang's avatar
      lib/btree.c: fix leak of whole btree nodes · cca69958
      Minfei Huang authored
      commit c75b53af upstream.
      
      I use btree from 3.14-rc2 in my own module.  When the btree module is
      removed, a warning arises:
      
       kmem_cache_destroy btree_node: Slab cache still has objects
       CPU: 13 PID: 9150 Comm: rmmod Tainted: GF          O 3.14.0-rc2 #1
       Hardware name: Inspur NF5270M3/NF5270M3, BIOS CHEETAH_2.1.3 09/10/2013
       Call Trace:
         dump_stack+0x49/0x5d
         kmem_cache_destroy+0xcf/0xe0
         btree_module_exit+0x10/0x12 [btree]
         SyS_delete_module+0x198/0x1f0
         system_call_fastpath+0x16/0x1b
      
      The cause is that it doesn't release the last btree node, when height = 1
      and fill = 1.
      
      [akpm@linux-foundation.org: remove unneeded test of NULL]
      Signed-off-by: default avatarMinfei Huang <huangminfei@ucloud.cn>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cca69958
    • Sasha Levin's avatar
      net/l2tp: don't fall back on UDP [get|set]sockopt · 56a2a6f6
      Sasha Levin authored
      commit 3cf521f7 upstream.
      
      The l2tp [get|set]sockopt() code has fallen back to the UDP functions
      for socket option levels != SOL_PPPOL2TP since day one, but that has
      never actually worked, since the l2tp socket isn't an inet socket.
      
      As David Miller points out:
      
        "If we wanted this to work, it'd have to look up the tunnel and then
         use tunnel->sk, but I wonder how useful that would be"
      
      Since this can never have worked so nobody could possibly have depended
      on that functionality, just remove the broken code and return -EINVAL.
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarJames Chapman <jchapman@katalix.com>
      Acked-by: default avatarDavid Miller <davem@davemloft.net>
      Cc: Phil Turnbull <phil.turnbull@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      56a2a6f6
    • Greg Kroah-Hartman's avatar
      Revert: "net: ip, ipv6: handle gso skbs in forwarding path" · 68199d62
      Greg Kroah-Hartman authored
      This reverts commit 29a3cd46 which is
      commit fe6cc55f upstream.
      
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Jarosch <thomas.jarosch@intra2net.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68199d62
    • Andy Lutomirski's avatar
      x86_64/entry/xen: Do not invoke espfix64 on Xen · 473d8a23
      Andy Lutomirski authored
      commit 7209a75d upstream.
      
      This moves the espfix64 logic into native_iret.  To make this work,
      it gets rid of the native patch for INTERRUPT_RETURN:
      INTERRUPT_RETURN on native kernels is now 'jmp native_iret'.
      
      This changes the 16-bit SS behavior on Xen from OOPSing to leaking
      some bits of the Xen hypervisor's RSP (I think).
      
      [ hpa: this is a nonzero cost on native, but probably not enough to
        measure. Xen needs to fix this in their own code, probably doing
        something equivalent to espfix64. ]
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.netSigned-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      473d8a23
    • H. Peter Anvin's avatar
      x86, espfix: Make it possible to disable 16-bit support · 4a0db8af
      H. Peter Anvin authored
      commit 34273f41 upstream.
      
      Embedded systems, which may be very memory-size-sensitive, are
      extremely unlikely to ever encounter any 16-bit software, so make it
      a CONFIG_EXPERT option to turn off support for any 16-bit software
      whatsoever.
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a0db8af
    • H. Peter Anvin's avatar
      x86, espfix: Make espfix64 a Kconfig option, fix UML · 7b6354ea
      H. Peter Anvin authored
      commit 197725de upstream.
      
      Make espfix64 a hidden Kconfig option.  This fixes the x86-64 UML
      build which had broken due to the non-existence of init_espfix_bsp()
      in UML: since UML uses its own Kconfig, this option does not appear in
      the UML build.
      
      This also makes it possible to make support for 16-bit segments a
      configuration option, for the people who want to minimize the size of
      the kernel.
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Richard Weinberger <richard@nod.at>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b6354ea
    • H. Peter Anvin's avatar
      x86, espfix: Fix broken header guard · 111f0ce4
      H. Peter Anvin authored
      commit 20b68535 upstream.
      
      Header guard is #ifndef, not #ifdef...
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      111f0ce4
    • H. Peter Anvin's avatar
      x86, espfix: Move espfix definitions into a separate header file · 18e63ea3
      H. Peter Anvin authored
      commit e1fe9ed8 upstream.
      
      Sparse warns that the percpu variables aren't declared before they are
      defined.  Rather than hacking around it, move espfix definitions into
      a proper header file.
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18e63ea3
    • H. Peter Anvin's avatar
      x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack · 3989298c
      H. Peter Anvin authored
      commit 3891a04a upstream.
      
      The IRET instruction, when returning to a 16-bit segment, only
      restores the bottom 16 bits of the user space stack pointer.  This
      causes some 16-bit software to break, but it also leaks kernel state
      to user space.  We have a software workaround for that ("espfix") for
      the 32-bit kernel, but it relies on a nonzero stack segment base which
      is not available in 64-bit mode.
      
      In checkin:
      
          b3b42ac2 x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
      
      we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
      the logic that 16-bit support is crippled on 64-bit kernels anyway (no
      V86 support), but it turns out that people are doing stuff like
      running old Win16 binaries under Wine and expect it to work.
      
      This works around this by creating percpu "ministacks", each of which
      is mapped 2^16 times 64K apart.  When we detect that the return SS is
      on the LDT, we copy the IRET frame to the ministack and use the
      relevant alias to return to userspace.  The ministacks are mapped
      readonly, so if IRET faults we promote #GP to #DF which is an IST
      vector and thus has its own stack; we then do the fixup in the #DF
      handler.
      
      (Making #GP an IST exception would make the msr_safe functions unsafe
      in NMI/MC context, and quite possibly have other effects.)
      
      Special thanks to:
      
      - Andy Lutomirski, for the suggestion of using very small stack slots
        and copy (as opposed to map) the IRET frame there, and for the
        suggestion to mark them readonly and let the fault promote to #DF.
      - Konrad Wilk for paravirt fixup and testing.
      - Borislav Petkov for testing help and useful comments.
      Reported-by: default avatarBrian Gerst <brgerst@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrew Lutomriski <amluto@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Dirk Hohndel <dirk@hohndel.org>
      Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
      Cc: comex <comexk@gmail.com>
      Cc: Alexander van Heukelum <heukelum@fastmail.fm>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3989298c
    • H. Peter Anvin's avatar
      Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option" · 6c7e43ad
      H. Peter Anvin authored
      commit 7ed6fb9b upstream.
      
      This reverts commit fa81511b in
      preparation of merging in the proper fix (espfix64).
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6c7e43ad
    • Jan Kara's avatar
      timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks · fbbb7208
      Jan Kara authored
      commit 504d5874 upstream.
      
      clockevents_increase_min_delta() calls printk() from under
      hrtimer_bases.lock. That causes lock inversion on scheduler locks because
      printk() can call into the scheduler. Lockdep puts it as:
      
      ======================================================
      [ INFO: possible circular locking dependency detected ]
      3.15.0-rc8-06195-g939f04be #2 Not tainted
      -------------------------------------------------------
      trinity-main/74 is trying to acquire lock:
       (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c
      
      but task is already holding lock:
       (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #5 (hrtimer_bases.lock){-.-...}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
             [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
             [<81080792>] task_clock_event_start+0x3a/0x3f
             [<810807a4>] task_clock_event_add+0xd/0x14
             [<8108259a>] event_sched_in+0xb6/0x17a
             [<810826a2>] group_sched_in+0x44/0x122
             [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
             [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
             [<81082bf6>] __perf_install_in_context+0x8b/0xa3
             [<8107eb8e>] remote_function+0x12/0x2a
             [<8105f5af>] smp_call_function_single+0x2d/0x53
             [<8107e17d>] task_function_call+0x30/0x36
             [<8107fb82>] perf_install_in_context+0x87/0xbb
             [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
             [<810856f9>] SyS_perf_event_open+0x17/0x19
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #4 (&ctx->lock){......}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f04c>] _raw_spin_lock+0x21/0x30
             [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
             [<8142cacc>] __schedule+0x4c6/0x4cb
             [<8142cae0>] schedule+0xf/0x11
             [<8142f9a6>] work_resched+0x5/0x30
      
      -> #3 (&rq->lock){-.-.-.}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f04c>] _raw_spin_lock+0x21/0x30
             [<81040873>] __task_rq_lock+0x33/0x3a
             [<8104184c>] wake_up_new_task+0x25/0xc2
             [<8102474b>] do_fork+0x15c/0x2a0
             [<810248a9>] kernel_thread+0x1a/0x1f
             [<814232a2>] rest_init+0x1a/0x10e
             [<817af949>] start_kernel+0x303/0x308
             [<817af2ab>] i386_start_kernel+0x79/0x7d
      
      -> #2 (&p->pi_lock){-.-...}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<810413dd>] try_to_wake_up+0x1d/0xd6
             [<810414cd>] default_wake_function+0xb/0xd
             [<810461f3>] __wake_up_common+0x39/0x59
             [<81046346>] __wake_up+0x29/0x3b
             [<811b8733>] tty_wakeup+0x49/0x51
             [<811c3568>] uart_write_wakeup+0x17/0x19
             [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
             [<811c5f28>] serial8250_handle_irq+0x54/0x6a
             [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
             [<811c56d8>] serial8250_interrupt+0x38/0x9e
             [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
             [<81051296>] handle_irq_event+0x2c/0x43
             [<81052cee>] handle_level_irq+0x57/0x80
             [<81002a72>] handle_irq+0x46/0x5c
             [<810027df>] do_IRQ+0x32/0x89
             [<8143036e>] common_interrupt+0x2e/0x33
             [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
             [<811c25a4>] uart_start+0x2d/0x32
             [<811c2c04>] uart_write+0xc7/0xd6
             [<811bc6f6>] n_tty_write+0xb8/0x35e
             [<811b9beb>] tty_write+0x163/0x1e4
             [<811b9cd9>] redirected_tty_write+0x6d/0x75
             [<810b6ed6>] vfs_write+0x75/0xb0
             [<810b7265>] SyS_write+0x44/0x77
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #1 (&tty->write_wait){-.....}:
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<81046332>] __wake_up+0x15/0x3b
             [<811b8733>] tty_wakeup+0x49/0x51
             [<811c3568>] uart_write_wakeup+0x17/0x19
             [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
             [<811c5f28>] serial8250_handle_irq+0x54/0x6a
             [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
             [<811c56d8>] serial8250_interrupt+0x38/0x9e
             [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
             [<81051296>] handle_irq_event+0x2c/0x43
             [<81052cee>] handle_level_irq+0x57/0x80
             [<81002a72>] handle_irq+0x46/0x5c
             [<810027df>] do_IRQ+0x32/0x89
             [<8143036e>] common_interrupt+0x2e/0x33
             [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
             [<811c25a4>] uart_start+0x2d/0x32
             [<811c2c04>] uart_write+0xc7/0xd6
             [<811bc6f6>] n_tty_write+0xb8/0x35e
             [<811b9beb>] tty_write+0x163/0x1e4
             [<811b9cd9>] redirected_tty_write+0x6d/0x75
             [<810b6ed6>] vfs_write+0x75/0xb0
             [<810b7265>] SyS_write+0x44/0x77
             [<8142f8ee>] syscall_call+0x7/0xb
      
      -> #0 (&port_lock_key){-.....}:
             [<8104a62d>] __lock_acquire+0x9ea/0xc6d
             [<8104a942>] lock_acquire+0x92/0x101
             [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
             [<811c60be>] serial8250_console_write+0x8c/0x10c
             [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
             [<8104f5d5>] console_unlock+0x1d7/0x398
             [<8104fb70>] vprintk_emit+0x3da/0x3e4
             [<81425f76>] printk+0x17/0x19
             [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
             [<8105c548>] clockevents_program_event+0xe7/0xf3
             [<8105cc1c>] tick_program_event+0x1e/0x23
             [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
             [<8103c49e>] __remove_hrtimer+0x5b/0x79
             [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
             [<8103cb4b>] hrtimer_cancel+0xd/0x18
             [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
             [<81080705>] task_clock_event_stop+0x20/0x64
             [<81080756>] task_clock_event_del+0xd/0xf
             [<81081350>] event_sched_out+0xab/0x11e
             [<810813e0>] group_sched_out+0x1d/0x66
             [<81081682>] ctx_sched_out+0xaf/0xbf
             [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
             [<8142cacc>] __schedule+0x4c6/0x4cb
             [<8142cae0>] schedule+0xf/0x11
             [<8142f9a6>] work_resched+0x5/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &port_lock_key --> &ctx->lock --> hrtimer_bases.lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(hrtimer_bases.lock);
                                     lock(&ctx->lock);
                                     lock(hrtimer_bases.lock);
        lock(&port_lock_key);
      
       *** DEADLOCK ***
      
      4 locks held by trinity-main/74:
       #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
       #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
       #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4
      
      stack backtrace:
      CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04be #2
       00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
       8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
       8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
      Call Trace:
       [<81426f69>] dump_stack+0x16/0x18
       [<81425a99>] print_circular_bug+0x18f/0x19c
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<811c60be>] ? serial8250_console_write+0x8c/0x10c
       [<811c6032>] ? wait_for_xmitr+0x76/0x76
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] ? serial8250_console_write+0x8c/0x10c
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104af87>] ? lock_release+0x191/0x223
       [<811c6032>] ? wait_for_xmitr+0x76/0x76
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8104416d>] ? __dequeue_entity+0x23/0x27
       [<81044505>] ? pick_next_task_fair+0xb1/0x120
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
       [<810475b0>] ? trace_hardirqs_off+0xb/0xd
       [<81056346>] ? rcu_irq_exit+0x64/0x77
      
      Fix the problem by using printk_deferred() which does not call into the
      scheduler.
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbbb7208
    • John Stultz's avatar
      printk: rename printk_sched to printk_deferred · f73ff697
      John Stultz authored
      commit aac74dc4 upstream.
      
      After learning we'll need some sort of deferred printk functionality in
      the timekeeping core, Peter suggested we rename the printk_sched function
      so it can be reused by needed subsystems.
      
      This only changes the function name. No logic changes.
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f73ff697
    • David Rientjes's avatar
      mm, thp: do not allow thp faults to avoid cpuset restrictions · 7f6c1deb
      David Rientjes authored
      commit b104a35d upstream.
      
      The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
      should be set in allocflags.  ALLOC_CPUSET controls if a page allocation
      should be restricted only to the set of allowed cpuset mems.
      
      Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
      the fault path from using memory compaction or direct reclaim.  Thus, it
      is unfairly able to allocate outside of its cpuset mems restriction as a
      side-effect.
      
      This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
      truly GFP_ATOMIC by verifying it is also not a thp allocation.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Tested-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f6c1deb
    • James Bottomley's avatar
      scsi: handle flush errors properly · 0e04ec4d
      James Bottomley authored
      commit 89fb4cd1 upstream.
      
      Flush commands don't transfer data and thus need to be special cased
      in the I/O completion handler so that we can propagate errors to
      the block layer and filesystem.
      Signed-off-by: default avatarJames Bottomley <JBottomley@Parallels.com>
      Reported-by: default avatarSteven Haber <steven@qumulo.com>
      Tested-by: default avatarSteven Haber <steven@qumulo.com>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e04ec4d
    • Konstantin Khlebnikov's avatar
      ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout · 2e926a6b
      Konstantin Khlebnikov authored
      commit 811a2407 upstream.
      
      On LPAE, each level 1 (pgd) page table entry maps 1GiB, and the level 2
      (pmd) entries map 2MiB.
      
      When the identity mapping is created on LPAE, the pgd pointers are copied
      from the swapper_pg_dir.  If we find that we need to modify the contents
      of a pmd, we allocate a new empty pmd table and insert it into the
      appropriate 1GB slot, before then filling it with the identity mapping.
      
      However, if the 1GB slot covers the kernel lowmem mappings, we obliterate
      those mappings.
      
      When replacing a PMD, first copy the old PMD contents to the new PMD, so
      that we preserve the existing mappings, particularly the mappings of the
      kernel itself.
      
      [rewrote commit message and added code comment -- rmk]
      
      Fixes: ae2de101 ("ARM: LPAE: Add identity mapping support for the 3-level page table format")
      Signed-off-by: default avatarKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e926a6b
    • Milan Broz's avatar
      crypto: af_alg - properly label AF_ALG socket · 44ce8f87
      Milan Broz authored
      commit 4c63f83c upstream.
      
      Th AF_ALG socket was missing a security label (e.g. SELinux)
      which means that socket was in "unlabeled" state.
      
      This was recently demonstrated in the cryptsetup package
      (cryptsetup v1.6.5 and later.)
      See https://bugzilla.redhat.com/show_bug.cgi?id=1115120
      
      This patch clones the sock's label from the parent sock
      and resolves the issue (similar to AF_BLUETOOTH protocol family).
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44ce8f87
  3. 31 Jul, 2014 11 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.4.101 · 91f7c8cb
      Greg Kroah-Hartman authored
      91f7c8cb
    • Catalin Marinas's avatar
      mm: kmemleak: avoid false negatives on vmalloc'ed objects · 7e2e6118
      Catalin Marinas authored
      commit 7f88f88f upstream.
      
      Commit 248ac0e1 ("mm/vmalloc: remove guard page from between vmap
      blocks") had the side effect of making vmap_area.va_end member point to
      the next vmap_area.va_start.  This was creating an artificial reference
      to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.
      
      This patch marks the vmap_area containing pointers explicitly and
      reduces the min ref_count to 2 as vm_struct still contains a reference
      to the vmalloc'ed object.  The kmemleak add_scan_area() function has
      been improved to allow a SIZE_MAX argument covering the rest of the
      object (for simpler calling sites).
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [hq: Backported to 3.4: Adjust context]
      Signed-off-by: default avatarQiang Huang <h.huangqiang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      7e2e6118
    • Xi Wang's avatar
      introduce SIZE_MAX · b06b5c62
      Xi Wang authored
      commit a3860c1c upstream.
      
      ULONG_MAX is often used to check for integer overflow when calculating
      allocation size.  While ULONG_MAX happens to work on most systems, there
      is no guarantee that `size_t' must be the same size as `long'.
      
      This patch introduces SIZE_MAX, the maximum value of `size_t', to improve
      portability and readability for allocation size validation.
      Signed-off-by: default avatarXi Wang <xi.wang@gmail.com>
      Acked-by: default avatarAlex Elder <elder@dreamhost.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Qiang Huang <h.huangqiang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b06b5c62
    • Martin Schwidefsky's avatar
      s390/ptrace: fix PSW mask check · 883ea134
      Martin Schwidefsky authored
      commit dab6cf55 upstream.
      
      The PSW mask check of the PTRACE_POKEUSR_AREA command is incorrect.
      The PSW_MASK_USER define contains the PSW_MASK_ASC bits, the ptrace
      interface accepts all combinations for the address-space-control
      bits. To protect the kernel space the PSW mask check in ptrace needs
      to reject the address-space-control bit combination for home space.
      
      Fixes CVE-2014-3534
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      883ea134
    • Linus Torvalds's avatar
      Fix gcc-4.9.0 miscompilation of load_balance() in scheduler · e5c662d1
      Linus Torvalds authored
      commit 2062afb4 upstream.
      
      Michel Dänzer and a couple of other people reported inexplicable random
      oopses in the scheduler, and the cause turns out to be gcc mis-compiling
      the load_balance() function when debugging is enabled.  The gcc bug
      apparently goes back to gcc-4.5, but slight optimization changes means
      that it now showed up as a problem in 4.9.0 and 4.9.1.
      
      The instruction scheduling problem causes gcc to schedule a spill
      operation to before the stack frame has been created, which in turn can
      corrupt the spilled value if an interrupt comes in.  There may be other
      effects of this bug too, but that's the code generation problem seen in
      Michel's case.
      
      This is fixed in current gcc HEAD, but the workaround as suggested by
      Markus Trippelsdorf is pretty simple: use -fno-var-tracking-assignments
      when compiling the kernel, which disables the gcc code that causes the
      problem.  This can result in slightly worse debug information for
      variable accesses, but that is infinitely preferable to actual code
      generation problems.
      
      Doing this unconditionally (not just for CONFIG_DEBUG_INFO) also allows
      non-debug builds to verify that the debug build would be identical: we
      can do
      
          export GCC_COMPARE_DEBUG=1
      
      to make gcc internally verify that the result of the build is
      independent of the "-g" flag (it will make the compiler build everything
      twice, toggling the debug flag, and compare the results).
      
      Without the "-fno-var-tracking-assignments" option, the build would fail
      (even with 4.8.3 that didn't show the actual stack frame bug) with a gcc
      compare failure.
      
      See also gcc bugzilla:
      
        https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61801Reported-by: default avatarMichel Dänzer <michel@daenzer.net>
      Suggested-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5c662d1
    • Naoya Horiguchi's avatar
      mm: hugetlb: fix copy_hugetlb_page_range() · 97f5dabb
      Naoya Horiguchi authored
      commit 0253d634 upstream.
      
      Commit 4a705fef ("hugetlb: fix copy_hugetlb_page_range() to handle
      migration/hwpoisoned entry") changed the order of
      huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
      in some workloads like hugepage-backed heap allocation via libhugetlbfs.
      This patch fixes it.
      
      The test program for the problem is shown below:
      
        $ cat heap.c
        #include <unistd.h>
        #include <stdlib.h>
        #include <string.h>
      
        #define HPS 0x200000
      
        int main() {
        	int i;
        	char *p = malloc(HPS);
        	memset(p, '1', HPS);
        	for (i = 0; i < 5; i++) {
        		if (!fork()) {
        			memset(p, '2', HPS);
        			p = malloc(HPS);
        			memset(p, '3', HPS);
        			free(p);
        			return 0;
        		}
        	}
        	sleep(1);
        	free(p);
        	return 0;
        }
      
        $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap
      
      Fixes 4a705fef ("hugetlb: fix copy_hugetlb_page_range() to handle
      migration/hwpoisoned entry"), so is applicable to -stable kernels which
      include it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: default avatarGuillaume Morin <guillaume@morinfr.org>
      Suggested-by: default avatarGuillaume Morin <guillaume@morinfr.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      97f5dabb
    • Sven Wegener's avatar
      x86_32, entry: Store badsys error code in %eax · 4aedd4b0
      Sven Wegener authored
      commit 8142b215 upstream.
      
      Commit 554086d8 ("x86_32, entry: Do syscall exit work on badsys
      (CVE-2014-4508)") introduced a regression in the x86_32 syscall entry
      code, resulting in syscall() not returning proper errors for undefined
      syscalls on CPUs supporting the sysenter feature.
      
      The following code:
      
      > int result = syscall(666);
      > printf("result=%d errno=%d error=%s\n", result, errno, strerror(errno));
      
      results in:
      
      > result=666 errno=0 error=Success
      
      Obviously, the syscall return value is the called syscall number, but it
      should have been an ENOSYS error. When run under ptrace it behaves
      correctly, which makes it hard to debug in the wild:
      
      > result=-1 errno=38 error=Function not implemented
      
      The %eax register is the return value register. For debugging via ptrace
      the syscall entry code stores the complete register context on the
      stack. The badsys handlers only store the ENOSYS error code in the
      ptrace register set and do not set %eax like a regular syscall handler
      would. The old resume_userspace call chain contains code that clobbers
      %eax and it restores %eax from the ptrace registers afterwards. The same
      goes for the ptrace-enabled call chain. When ptrace is not used, the
      syscall return value is the passed-in syscall number from the untouched
      %eax register.
      
      Use %eax as the return value register in syscall_badsys and
      sysenter_badsys, like a real syscall handler does, and have the caller
      push the value onto the stack for ptrace access.
      Signed-off-by: default avatarSven Wegener <sven.wegener@stealer.net>
      Link: http://lkml.kernel.org/r/alpine.LNX.2.11.1407221022380.31021@titan.int.lan.stealer.netReviewed-and-tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4aedd4b0
    • Romain Degez's avatar
      ahci: add support for the Promise FastTrak TX8660 SATA HBA (ahci mode) · 46005527
      Romain Degez authored
      commit b32bfc06 upstream.
      
      Add support of the Promise FastTrak TX8660 SATA HBA in ahci mode by
      registering the board in the ahci_pci_tbl[].
      
      Note: this HBA also provide a hardware RAID mode when activated in
      BIOS but specific drivers from the manufacturer are required in this
      case.
      Signed-off-by: default avatarRomain Degez <romain.degez@gmail.com>
      Tested-by: default avatarRomain Degez <romain.degez@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46005527
    • Tejun Heo's avatar
      libata: introduce ata_host->n_tags to avoid oops on SAS controllers · 80cd492c
      Tejun Heo authored
      commit 1a112d10 upstream.
      
      1871ee13 ("libata: support the ata host which implements a queue
      depth less than 32") directly used ata_port->scsi_host->can_queue from
      ata_qc_new() to determine the number of tags supported by the host;
      unfortunately, SAS controllers doing SATA don't initialize ->scsi_host
      leading to the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
       IP: [<ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
       PGD 0
       Oops: 0002 [#1] SMP
       Modules linked in: isci libsas scsi_transport_sas mgag200 drm_kms_helper ttm
       CPU: 1 PID: 518 Comm: udevd Not tainted 3.16.0-rc6+ #62
       Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
       task: ffff880c1a00b280 ti: ffff88061a000000 task.ti: ffff88061a000000
       RIP: 0010:[<ffffffff814e0618>]  [<ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
       RSP: 0018:ffff88061a003ae8  EFLAGS: 00010012
       RAX: 0000000000000001 RBX: ffff88000241ca80 RCX: 00000000000000fa
       RDX: 0000000000000020 RSI: 0000000000000020 RDI: ffff8806194aa298
       RBP: ffff88061a003ae8 R08: ffff8806194a8000 R09: 0000000000000000
       R10: 0000000000000000 R11: ffff88000241ca80 R12: ffff88061ad58200
       R13: ffff8806194aa298 R14: ffffffff814e67a0 R15: ffff8806194a8000
       FS:  00007f3ad7fe3840(0000) GS:ffff880627620000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000058 CR3: 000000061a118000 CR4: 00000000001407e0
       Stack:
        ffff88061a003b20 ffffffff814e96e1 ffff88000241ca80 ffff88061ad58200
        ffff8800b6bf6000 ffff880c1c988000 ffff880619903850 ffff88061a003b68
        ffffffffa0056ce1 ffff88061a003b48 0000000013d6e6f8 ffff88000241ca80
       Call Trace:
        [<ffffffff814e96e1>] ata_sas_queuecmd+0xa1/0x430
        [<ffffffffa0056ce1>] sas_queuecommand+0x191/0x220 [libsas]
        [<ffffffff8149afee>] scsi_dispatch_cmd+0x10e/0x300 [<ffffffff814a3bc5>] scsi_request_fn+0x2f5/0x550
        [<ffffffff81317613>] __blk_run_queue+0x33/0x40
        [<ffffffff8131781a>] queue_unplugged+0x2a/0x90
        [<ffffffff8131ceb4>] blk_flush_plug_list+0x1b4/0x210
        [<ffffffff8131d274>] blk_finish_plug+0x14/0x50
        [<ffffffff8117eaa8>] __do_page_cache_readahead+0x198/0x1f0
        [<ffffffff8117ee21>] force_page_cache_readahead+0x31/0x50
        [<ffffffff8117ee7e>] page_cache_sync_readahead+0x3e/0x50
        [<ffffffff81172ac6>] generic_file_read_iter+0x496/0x5a0
        [<ffffffff81219897>] blkdev_read_iter+0x37/0x40
        [<ffffffff811e307e>] new_sync_read+0x7e/0xb0
        [<ffffffff811e3734>] vfs_read+0x94/0x170
        [<ffffffff811e43c6>] SyS_read+0x46/0xb0
        [<ffffffff811e33d1>] ? SyS_lseek+0x91/0xb0
        [<ffffffff8171ee29>] system_call_fastpath+0x16/0x1b
       Code: 00 00 00 88 50 29 83 7f 08 01 19 d2 83 e2 f0 83 ea 50 88 50 34 c6 81 1d 02 00 00 40 c6 81 17 02 00 00 00 5d c3 66 0f 1f 44 00 00 <89> 14 25 58 00 00 00
      
      Fix it by introducing ata_host->n_tags which is initialized to
      ATA_MAX_QUEUE - 1 in ata_host_init() for SAS controllers and set to
      scsi_host_template->can_queue in ata_host_register() for !SAS ones.
      As SAS hosts are never registered, this will give them the same
      ATA_MAX_QUEUE - 1 as before.  Note that we can't use
      scsi_host->can_queue directly for SAS hosts anyway as they can go
      higher than the libata maximum.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMike Qiu <qiudayu@linux.vnet.ibm.com>
      Reported-by: default avatarJesse Brandeburg <jesse.brandeburg@gmail.com>
      Reported-by: default avatarPeter Hurley <peter@hurleysoftware.com>
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Fixes: 1871ee13 ("libata: support the ata host which implements a queue depth less than 32")
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80cd492c
    • Kevin Hao's avatar
      libata: support the ata host which implements a queue depth less than 32 · 6f0844c4
      Kevin Hao authored
      commit 1871ee13 upstream.
      
      The sata on fsl mpc8315e is broken after the commit 8a4aeec8
      ("libata/ahci: accommodate tag ordered controllers"). The reason is
      that the ata controller on this SoC only implement a queue depth of
      16. When issuing the commands in tag order, all the commands in tag
      16 ~ 31 are mapped to tag 0 unconditionally and then causes the sata
      malfunction. It makes no senses to use a 32 queue in software while
      the hardware has less queue depth. So consider the queue depth
      implemented by the hardware when requesting a command tag.
      
      Fixes: 8a4aeec8 ("libata/ahci: accommodate tag ordered controllers")
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f0844c4
    • Christoph Hellwig's avatar
      block: don't assume last put of shared tags is for the host · 696acd9d
      Christoph Hellwig authored
      commit d45b3279 upstream.
      
      There is no inherent reason why the last put of a tag structure must be
      the one for the Scsi_Host, as device model objects can be held for
      arbitrary periods.  Merge blk_free_tags and __blk_free_tags into a single
      funtion that just release a references and get rid of the BUG() when the
      host reference wasn't the last.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      696acd9d
  4. 28 Jul, 2014 7 commits