1. 20 Oct, 2015 14 commits
  2. 13 Oct, 2015 4 commits
    • Marcelo Ricardo Leitner's avatar
      sctp: fix race on protocol/netns initialization · 75113254
      Marcelo Ricardo Leitner authored
      commit 8e2d61e0 upstream.
      
      Consider sctp module is unloaded and is being requested because an user
      is creating a sctp socket.
      
      During initialization, sctp will add the new protocol type and then
      initialize pernet subsys:
      
              status = sctp_v4_protosw_init();
              if (status)
                      goto err_protosw_init;
      
              status = sctp_v6_protosw_init();
              if (status)
                      goto err_v6_protosw_init;
      
              status = register_pernet_subsys(&sctp_net_ops);
      
      The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
      is possible for userspace to create SCTP sockets like if the module is
      already fully loaded. If that happens, one of the possible effects is
      that we will have readers for net->sctp.local_addr_list list earlier
      than expected and sctp_net_init() does not take precautions while
      dealing with that list, leading to a potential panic but not limited to
      that, as sctp_sock_init() will copy a bunch of blank/partially
      initialized values from net->sctp.
      
      The race happens like this:
      
           CPU 0                           |  CPU 1
        socket()                           |
         __sock_create                     | socket()
          inet_create                      |  __sock_create
           list_for_each_entry_rcu(        |
              answer, &inetsw[sock->type], |
              list) {                      |   inet_create
            /* no hits */                  |
           if (unlikely(err)) {            |
            ...                            |
            request_module()               |
            /* socket creation is blocked  |
             * the module is fully loaded  |
             */                            |
             sctp_init                     |
              sctp_v4_protosw_init         |
               inet_register_protosw       |
                list_add_rcu(&p->list,     |
                             last_perm);   |
                                           |  list_for_each_entry_rcu(
                                           |     answer, &inetsw[sock->type],
              sctp_v6_protosw_init         |     list) {
                                           |     /* hit, so assumes protocol
                                           |      * is already loaded
                                           |      */
                                           |  /* socket creation continues
                                           |   * before netns is initialized
                                           |   */
              register_pernet_subsys       |
      
      Simply inverting the initialization order between
      register_pernet_subsys() and sctp_v4_protosw_init() is not possible
      because register_pernet_subsys() will create a control sctp socket, so
      the protocol must be already visible by then. Deferring the socket
      creation to a work-queue is not good specially because we loose the
      ability to handle its errors.
      
      So, as suggested by Vlad, the fix is to split netns initialization in
      two moments: defaults and control socket, so that the defaults are
      already loaded by when we register the protocol, while control socket
      initialization is kept at the same moment it is today.
      
      Fixes: 4db67e80 ("sctp: Make the address lists per network namespace")
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      75113254
    • Eric W. Biederman's avatar
      vfs: Test for and handle paths that are unreachable from their mnt_root · 29da259e
      Eric W. Biederman authored
      commit 397d425d upstream.
      
      In rare cases a directory can be renamed out from under a bind mount.
      In those cases without special handling it becomes possible to walk up
      the directory tree to the root dentry of the filesystem and down
      from the root dentry to every other file or directory on the filesystem.
      
      Like division by zero .. from an unconnected path can not be given
      a useful semantic as there is no predicting at which path component
      the code will realize it is unconnected.  We certainly can not match
      the current behavior as the current behavior is a security hole.
      
      Therefore when encounting .. when following an unconnected path
      return -ENOENT.
      
      - Add a function path_connected to verify path->dentry is reachable
        from path->mnt.mnt_root.  AKA to validate that rename did not do
        something nasty to the bind mount.
      
        To avoid races path_connected must be called after following a path
        component to it's next path component.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      29da259e
    • Eric W. Biederman's avatar
      dcache: Handle escaped paths in prepend_path · df50ffca
      Eric W. Biederman authored
      commit cde93be4 upstream.
      
      A rename can result in a dentry that by walking up d_parent
      will never reach it's mnt_root.  For lack of a better term
      I call this an escaped path.
      
      prepend_path is called by four different functions __d_path,
      d_absolute_path, d_path, and getcwd.
      
      __d_path only wants to see paths are connected to the root it passes
      in.  So __d_path needs prepend_path to return an error.
      
      d_absolute_path similarly wants to see paths that are connected to
      some root.  Escaped paths are not connected to any mnt_root so
      d_absolute_path needs prepend_path to return an error greater
      than 1.  So escaped paths will be treated like paths on lazily
      unmounted mounts.
      
      getcwd needs to prepend "(unreachable)" so getcwd also needs
      prepend_path to return an error.
      
      d_path is the interesting hold out.  d_path just wants to print
      something, and does not care about the weird cases.  Which raises
      the question what should be printed?
      
      Given that <escaped_path>/<anything> should result in -ENOENT I
      believe it is desirable for escaped paths to be printed as empty
      paths.  As there are not really any meaninful path components when
      considered from the perspective of a mount tree.
      
      So tweak prepend_path to return an empty path with an new error
      code of 3 when it encounters an escaped path.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      df50ffca
    • Johan Hovold's avatar
      USB: whiteheat: fix potential null-deref at probe · 28f76f92
      Johan Hovold authored
      commit cbb4be65 upstream.
      
      Fix potential null-pointer dereference at probe by making sure that the
      required endpoints are present.
      
      The whiteheat driver assumes there are at least five pairs of bulk
      endpoints, of which the final pair is used for the "command port". An
      attempt to bind to an interface with fewer bulk endpoints would
      currently lead to an oops.
      
      Fixes CVE-2015-5257.
      Reported-by: default avatarMoein Ghasemzadeh <moein@istuary.com>
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      28f76f92
  3. 25 Sep, 2015 1 commit
  4. 22 Sep, 2015 8 commits
  5. 21 Sep, 2015 13 commits
    • David S. Miller's avatar
      sparc64: Fix userspace FPU register corruptions. · 31de7bfa
      David S. Miller authored
      [ Upstream commit 44922150 ]
      
      If we have a series of events from userpsace, with %fprs=FPRS_FEF,
      like follows:
      
      ETRAP
      	ETRAP
      		VIS_ENTRY(fprs=0x4)
      		VIS_EXIT
      		RTRAP (kernel FPU restore with fpu_saved=0x4)
      	RTRAP
      
      We will not restore the user registers that were clobbered by the FPU
      using kernel code in the inner-most trap.
      
      Traps allocate FPU save slots in the thread struct, and FPU using
      sequences save the "dirty" FPU registers only.
      
      This works at the initial trap level because all of the registers
      get recorded into the top-level FPU save area, and we'll return
      to userspace with the FPU disabled so that any FPU use by the user
      will take an FPU disabled trap wherein we'll load the registers
      back up properly.
      
      But this is not how trap returns from kernel to kernel operate.
      
      The simplest fix for this bug is to always save all FPU register state
      for anything other than the top-most FPU save area.
      
      Getting rid of the optimized inner-slot FPU saving code ends up
      making VISEntryHalf degenerate into plain VISEntry.
      
      Longer term we need to do something smarter to reinstate the partial
      save optimizations.  Perhaps the fundament error is having trap entry
      and exit allocate FPU save slots and restore register state.  Instead,
      the VISEntry et al. calls should be doing that work.
      
      This bug is about two decades old.
      Reported-by: default avatarJames Y Knight <jyknight@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      31de7bfa
    • Eric Dumazet's avatar
      udp: fix dst races with multicast early demux · 9625566c
      Eric Dumazet authored
      commit 10e2eb87 upstream.
      
      Multicast dst are not cached. They carry DST_NOCACHE.
      
      As mentioned in commit f8864972 ("ipv4: fix dst race in
      sk_dst_get()"), these dst need special care before caching them
      into a socket.
      
      Caching them is allowed only if their refcnt was not 0, ie we
      must use atomic_inc_not_zero()
      
      Also, we must use READ_ONCE() to fetch sk->sk_rx_dst, as mentioned
      in commit d0c294c5 ("tcp: prevent fetching dst twice in early demux
      code")
      
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Tested-by: default avatarGregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarGregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
      Reported-by: default avatarAlex Gartrell <agartrell@fb.com>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [ luis: backported to 3.16: used davem's backport to 3.14 ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      9625566c
    • Dan Carpenter's avatar
      rds: fix an integer overflow test in rds_info_getsockopt() · 950a7a69
      Dan Carpenter authored
      commit 468b732b upstream.
      
      "len" is a signed integer.  We check that len is not negative, so it
      goes from zero to INT_MAX.  PAGE_SIZE is unsigned long so the comparison
      is type promoted to unsigned long.  ULONG_MAX - 4095 is a higher than
      INT_MAX so the condition can never be true.
      
      I don't know if this is harmful but it seems safe to limit "len" to
      INT_MAX - 4095.
      
      Fixes: a8c879a7 ('RDS: Info and stats')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      950a7a69
    • Herbert Xu's avatar
      net: Fix skb_set_peeked use-after-free bug · 1f61c92c
      Herbert Xu authored
      commit a0a2a660 upstream.
      
      The commit 738ac1eb ("net: Clone
      skb before setting peeked flag") introduced a use-after-free bug
      in skb_recv_datagram.  This is because skb_set_peeked may create
      a new skb and free the existing one.  As it stands the caller will
      continue to use the old freed skb.
      
      This patch fixes it by making skb_set_peeked return the new skb
      (or the old one if unchanged).
      
      Fixes: 738ac1eb ("net: Clone skb before setting peeked flag")
      Reported-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Reviewed-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      1f61c92c
    • David Ahern's avatar
      net: Fix RCU splat in af_key · 815c0610
      David Ahern authored
      commit ba51b6be upstream.
      
      Hit the following splat testing VRF change for ipsec:
      
      [  113.475692] ===============================
      [  113.476194] [ INFO: suspicious RCU usage. ]
      [  113.476667] 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED Not tainted
      [  113.477545] -------------------------------
      [  113.478013] /work/monster-14/dsa/kernel.git/include/linux/rcupdate.h:568 Illegal context switch in RCU read-side critical section!
      [  113.479288]
      [  113.479288] other info that might help us debug this:
      [  113.479288]
      [  113.480207]
      [  113.480207] rcu_scheduler_active = 1, debug_locks = 1
      [  113.480931] 2 locks held by setkey/6829:
      [  113.481371]  #0:  (&net->xfrm.xfrm_cfg_mutex){+.+.+.}, at: [<ffffffff814e9887>] pfkey_sendmsg+0xfb/0x213
      [  113.482509]  #1:  (rcu_read_lock){......}, at: [<ffffffff814e767f>] rcu_read_lock+0x0/0x6e
      [  113.483509]
      [  113.483509] stack backtrace:
      [  113.484041] CPU: 0 PID: 6829 Comm: setkey Not tainted 4.2.0-rc6-1+deb7u2+clUNRELEASED #3.2.65-1+deb7u2+clUNRELEASED
      [  113.485422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
      [  113.486845]  0000000000000001 ffff88001d4c7a98 ffffffff81518af2 ffffffff81086962
      [  113.487732]  ffff88001d538480 ffff88001d4c7ac8 ffffffff8107ae75 ffffffff8180a154
      [  113.488628]  0000000000000b30 0000000000000000 00000000000000d0 ffff88001d4c7ad8
      [  113.489525] Call Trace:
      [  113.489813]  [<ffffffff81518af2>] dump_stack+0x4c/0x65
      [  113.490389]  [<ffffffff81086962>] ? console_unlock+0x3d6/0x405
      [  113.491039]  [<ffffffff8107ae75>] lockdep_rcu_suspicious+0xfa/0x103
      [  113.491735]  [<ffffffff81064032>] rcu_preempt_sleep_check+0x45/0x47
      [  113.492442]  [<ffffffff8106404d>] ___might_sleep+0x19/0x1c8
      [  113.493077]  [<ffffffff81064268>] __might_sleep+0x6c/0x82
      [  113.493681]  [<ffffffff81133190>] cache_alloc_debugcheck_before.isra.50+0x1d/0x24
      [  113.494508]  [<ffffffff81134876>] kmem_cache_alloc+0x31/0x18f
      [  113.495149]  [<ffffffff814012b5>] skb_clone+0x64/0x80
      [  113.495712]  [<ffffffff814e6f71>] pfkey_broadcast_one+0x3d/0xff
      [  113.496380]  [<ffffffff814e7b84>] pfkey_broadcast+0xb5/0x11e
      [  113.497024]  [<ffffffff814e82d1>] pfkey_register+0x191/0x1b1
      [  113.497653]  [<ffffffff814e9770>] pfkey_process+0x162/0x17e
      [  113.498274]  [<ffffffff814e9895>] pfkey_sendmsg+0x109/0x213
      
      In pfkey_sendmsg the net mutex is taken and then pfkey_broadcast takes
      the RCU lock.
      
      Since pfkey_broadcast takes the RCU lock the allocation argument is
      pointless since GFP_ATOMIC must be used between the rcu_read_{,un}lock.
      The one call outside of rcu can be done with GFP_KERNEL.
      
      Fixes: 7f6b9dbd ("af_key: locking change")
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      815c0610
    • huaibin Wang's avatar
      ip6_gre: release cached dst on tunnel removal · 794674db
      huaibin Wang authored
      commit d4257295 upstream.
      
      When a tunnel is deleted, the cached dst entry should be released.
      
      This problem may prevent the removal of a netns (seen with a x-netns IPv6
      gre tunnel):
        unregister_netdevice: waiting for lo to become free. Usage count = 3
      
      CC: Dmitry Kozlov <xeb@mail.ru>
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Signed-off-by: default avatarhuaibin Wang <huaibin.wang@6wind.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [ kamal: backport to 3.13-stable ]
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      794674db
    • Marek Lindner's avatar
      batman-adv: protect tt_local_entry from concurrent delete events · ccc1ebfa
      Marek Lindner authored
      commit ef72706a upstream.
      
      The tt_local_entry deletion performed in batadv_tt_local_remove() was neither
      protecting against simultaneous deletes nor checking whether the element was
      still part of the list before calling hlist_del_rcu().
      
      Replacing the hlist_del_rcu() call with batadv_hash_remove() provides adequate
      protection via hash spinlocks as well as an is-element-still-in-hash check to
      avoid 'blind' hash removal.
      
      Fixes: 068ee6e2 ("batman-adv: roaming handling mechanism redesign")
      Reported-by: alfonsname@web.de
      Signed-off-by: default avatarMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: default avatarAntonio Quartulli <antonio@meshcoding.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      ccc1ebfa
    • Marc Zyngier's avatar
      arm64: KVM: Fix host crash when injecting a fault into a 32bit guest · a1631ba4
      Marc Zyngier authored
      commit 126c69a0 upstream.
      
      When injecting a fault into a misbehaving 32bit guest, it seems
      rather idiotic to also inject a 64bit fault that is only going
      to corrupt the guest state. This leads to a situation where we
      perform an illegal exception return at EL2 causing the host
      to crash instead of killing the guest.
      
      Just fix the stupid bug that has been there from day 1.
      Reported-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Tested-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      a1631ba4
    • Guillermo A. Amaral's avatar
      Add factory recertified Crucial M500s to blacklist · 1e4d0268
      Guillermo A. Amaral authored
      commit 7a7184b0 upstream.
      
      The Crucial M500 is known to have issues with queued TRIM commands, the
      factory recertified SSDs use a different model number naming convention
      which causes them to get ignored by the blacklist.
      
      The new naming convention boils down to: s/Crucial_/FC/
      Signed-off-by: default avatarGuillermo A. Amaral <g@maral.me>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      [ luis: backported to 3.16:
        - dropped ATA_HORKAGE_ZERO_AFTER_TRIM flag
        - adjusted context ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      1e4d0268
    • Manfred Spraul's avatar
      ipc/sem.c: update/correct memory barriers · da6c1a2f
      Manfred Spraul authored
      commit 3ed1f8a9 upstream.
      
      sem_lock() did not properly pair memory barriers:
      
      !spin_is_locked() and spin_unlock_wait() are both only control barriers.
      The code needs an acquire barrier, otherwise the cpu might perform read
      operations before the lock test.
      
      As no primitive exists inside <include/spinlock.h> and since it seems
      noone wants another primitive, the code creates a local primitive within
      ipc/sem.c.
      
      With regards to -stable:
      
      The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
      nop (just a preprocessor redefinition to improve the readability).  The
      bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
      starting from 3.10).
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      da6c1a2f
    • Manfred Spraul's avatar
      ipc/sem.c: change memory barrier in sem_lock() to smp_rmb() · 355b0f74
      Manfred Spraul authored
      commit 2e094abf upstream.
      
      When I fixed bugs in the sem_lock() logic, I was more conservative than
      necessary.  Therefore it is safe to replace the smp_mb() with smp_rmb().
      And: With smp_rmb(), semop() syscalls are up to 10% faster.
      
      The race we must protect against is:
      
      	sem->lock is free
      	sma->complex_count = 0
      	sma->sem_perm.lock held by thread B
      
      thread A:
      
      A: spin_lock(&sem->lock)
      
      			B: sma->complex_count++; (now 1)
      			B: spin_unlock(&sma->sem_perm.lock);
      
      A: spin_is_locked(&sma->sem_perm.lock);
      A: XXXXX memory barrier
      A: if (sma->complex_count == 0)
      
      Thread A must read the increased complex_count value, i.e. the read must
      not be reordered with the read of sem_perm.lock done by spin_is_locked().
      
      Since it's about ordering of reads, smp_rmb() is sufficient.
      
      [akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr]
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [ luis: 3.16 prereq for:
        3ed1f8a9 "ipc/sem.c: update/correct memory barriers" ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      355b0f74
    • Herton R. Krzesinski's avatar
      ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits · a99220f8
      Herton R. Krzesinski authored
      commit 602b8593 upstream.
      
      The current semaphore code allows a potential use after free: in
      exit_sem we may free the task's sem_undo_list while there is still
      another task looping through the same semaphore set and cleaning the
      sem_undo list at freeary function (the task called IPC_RMID for the same
      semaphore set).
      
      For example, with a test program [1] running which keeps forking a lot
      of processes (which then do a semop call with SEM_UNDO flag), and with
      the parent right after removing the semaphore set with IPC_RMID, and a
      kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
      CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
      in the kernel log:
      
         Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
         000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b  kkkkkkkk.kkkkkkk
         010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  ....kkkk........
         Prev obj: start=ffff88003b45c180, len=64
         000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a  .....N......ZZZZ
         010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff  ...........7....
         Next obj: start=ffff88003b45c200, len=64
         000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a  .....N......ZZZZ
         010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff  ........h).<....
         BUG: spinlock wrong CPU on CPU#2, test/18028
         general protection fault: 0000 [#1] SMP
         Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
         CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
         RIP: spin_dump+0x53/0xc0
         Call Trace:
           spin_bug+0x30/0x40
           do_raw_spin_unlock+0x71/0xa0
           _raw_spin_unlock+0xe/0x10
           freeary+0x82/0x2a0
           ? _raw_spin_lock+0xe/0x10
           semctl_down.clone.0+0xce/0x160
           ? __do_page_fault+0x19a/0x430
           ? __audit_syscall_entry+0xa8/0x100
           SyS_semctl+0x236/0x2c0
           ? syscall_trace_leave+0xde/0x130
           entry_SYSCALL_64_fastpath+0x12/0x71
         Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
         RIP  [<ffffffff810d6053>] spin_dump+0x53/0xc0
          RSP <ffff88003750fd68>
         ---[ end trace 783ebb76612867a0 ]---
         NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
         Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
         CPU: 3 PID: 18053 Comm: test Tainted: G      D         4.2.0-rc5+ #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
         RIP: native_read_tsc+0x0/0x20
         Call Trace:
           ? delay_tsc+0x40/0x70
           __delay+0xf/0x20
           do_raw_spin_lock+0x96/0x140
           _raw_spin_lock+0xe/0x10
           sem_lock_and_putref+0x11/0x70
           SYSC_semtimedop+0x7bf/0x960
           ? handle_mm_fault+0xbf6/0x1880
           ? dequeue_task_fair+0x79/0x4a0
           ? __do_page_fault+0x19a/0x430
           ? kfree_debugcheck+0x16/0x40
           ? __do_page_fault+0x19a/0x430
           ? __audit_syscall_entry+0xa8/0x100
           ? do_audit_syscall_entry+0x66/0x70
           ? syscall_trace_enter_phase1+0x139/0x160
           SyS_semtimedop+0xe/0x10
           SyS_semop+0x10/0x20
           entry_SYSCALL_64_fastpath+0x12/0x71
         Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
         Kernel panic - not syncing: softlockup: hung tasks
      
      I wasn't able to trigger any badness on a recent kernel without the
      proper config debugs enabled, however I have softlockup reports on some
      kernel versions, in the semaphore code, which are similar as above (the
      scenario is seen on some servers running IBM DB2 which uses semaphore
      syscalls).
      
      The patch here fixes the race against freeary, by acquiring or waiting
      on the sem_undo_list lock as necessary (exit_sem can race with freeary,
      while freeary sets un->semid to -1 and removes the same sem_undo from
      list_proc or when it removes the last sem_undo).
      
      After the patch I'm unable to reproduce the problem using the test case
      [1].
      
      [1] Test case used below:
      
          #include <stdio.h>
          #include <sys/types.h>
          #include <sys/ipc.h>
          #include <sys/sem.h>
          #include <sys/wait.h>
          #include <stdlib.h>
          #include <time.h>
          #include <unistd.h>
          #include <errno.h>
      
          #define NSEM 1
          #define NSET 5
      
          int sid[NSET];
      
          void thread()
          {
                  struct sembuf op;
                  int s;
                  uid_t pid = getuid();
      
                  s = rand() % NSET;
                  op.sem_num = pid % NSEM;
                  op.sem_op = 1;
                  op.sem_flg = SEM_UNDO;
      
                  semop(sid[s], &op, 1);
                  exit(EXIT_SUCCESS);
          }
      
          void create_set()
          {
                  int i, j;
                  pid_t p;
                  union {
                          int val;
                          struct semid_ds *buf;
                          unsigned short int *array;
                          struct seminfo *__buf;
                  } un;
      
                  /* Create and initialize semaphore set */
                  for (i = 0; i < NSET; i++) {
                          sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
                          if (sid[i] < 0) {
                                  perror("semget");
                                  exit(EXIT_FAILURE);
                          }
                  }
                  un.val = 0;
                  for (i = 0; i < NSET; i++) {
                          for (j = 0; j < NSEM; j++) {
                                  if (semctl(sid[i], j, SETVAL, un) < 0)
                                          perror("semctl");
                          }
                  }
      
                  /* Launch threads that operate on semaphore set */
                  for (i = 0; i < NSEM * NSET * NSET; i++) {
                          p = fork();
                          if (p < 0)
                                  perror("fork");
                          if (p == 0)
                                  thread();
                  }
      
                  /* Free semaphore set */
                  for (i = 0; i < NSET; i++) {
                          if (semctl(sid[i], NSEM, IPC_RMID))
                                  perror("IPC_RMID");
                  }
      
                  /* Wait for forked processes to exit */
                  while (wait(NULL)) {
                          if (errno == ECHILD)
                                  break;
                  };
          }
      
          int main(int argc, char **argv)
          {
                  pid_t p;
      
                  srand(time(NULL));
      
                  while (1) {
                          p = fork();
                          if (p < 0) {
                                  perror("fork");
                                  exit(EXIT_FAILURE);
                          }
                          if (p == 0) {
                                  create_set();
                                  goto end;
                          }
      
                          /* Wait for forked processes to exit */
                          while (wait(NULL)) {
                                  if (errno == ECHILD)
                                          break;
                          };
                  }
          end:
                  return 0;
          }
      
      [akpm@linux-foundation.org: use normal comment layout]
      Signed-off-by: default avatarHerton R. Krzesinski <herton@redhat.com>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      CC: Aristeu Rozanski <aris@redhat.com>
      Cc: David Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      a99220f8
    • Wanpeng Li's avatar
      mm/hwpoison: fix page refcount of unknown non LRU page · 0cc1cdb1
      Wanpeng Li authored
      commit 4f32be67 upstream.
      
      After trying to drain pages from pagevec/pageset, we try to get reference
      count of the page again, however, the reference count of the page is not
      reduced if the page is still not on LRU list.
      
      Fix it by adding the put_page() to drop the page reference which is from
      __get_any_page().
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      0cc1cdb1