1. 24 Nov, 2017 13 commits
    • alex chen's avatar
      ocfs2: should wait dio before inode lock in ocfs2_setattr() · c4baa4a5
      alex chen authored
      commit 28f5a8a7 upstream.
      
      we should wait dio requests to finish before inode lock in
      ocfs2_setattr(), otherwise the following deadlock will happen:
      
      process 1                  process 2                    process 3
      truncate file 'A'          end_io of writing file 'A'   receiving the bast messages
      ocfs2_setattr
       ocfs2_inode_lock_tracker
        ocfs2_inode_lock_full
       inode_dio_wait
        __inode_dio_wait
        -->waiting for all dio
        requests finish
                                                              dlm_proxy_ast_handler
                                                               dlm_do_local_bast
                                                                ocfs2_blocking_ast
                                                                 ocfs2_generic_handle_bast
                                                                  set OCFS2_LOCK_BLOCKED flag
                              dio_end_io
                               dio_bio_end_aio
                                dio_complete
                                 ocfs2_dio_end_io
                                  ocfs2_dio_end_io_write
                                   ocfs2_inode_lock
                                    __ocfs2_cluster_lock
                                     ocfs2_wait_for_mask
                                     -->waiting for OCFS2_LOCK_BLOCKED
                                     flag to be cleared, that is waiting
                                     for 'process 1' unlocking the inode lock
                                 inode_dio_end
                                 -->here dec the i_dio_count, but will never
                                 be called, so a deadlock happened.
      
      Link: http://lkml.kernel.org/r/59F81636.70508@huawei.comSigned-off-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Acked-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4baa4a5
    • Keith Busch's avatar
      nvme: Fix memory order on async queue deletion · 8c325770
      Keith Busch authored
      This patch is a fix specific to the 3.19 - 4.4 kernels. The 4.5 kernel
      inadvertently fixed this bug differently (db3cbfff), but is not
      a stable candidate due it being a complicated re-write of the entire
      feature.
      
      This patch fixes a potential timing bug with nvme's asynchronous queue
      deletion, which causes an allocated request to be accidentally released
      due to the ordering of the shared completion context among the sq/cq
      pair. The completion context saves the request that issued the queue
      deletion. If the submission side deletion happens to reset the active
      request, the completion side will release the wrong request tag back into
      the pool of available tags. This means the driver will create multiple
      commands with the same tag, corrupting the queue context.
      
      The error is observable in the kernel logs like:
      
        "nvme XX:YY:ZZ completed id XX twice on qid:0"
      
      In this particular case, this message occurs because the queue is
      corrupted.
      
      The following timing sequence demonstrates the error:
      
        CPU A                                 CPU B
        -----------------------               -----------------------------
        nvme_irq
         nvme_process_cq
          async_completion
           queue_kthread_work  ----------->   nvme_del_sq_work_handler
                                               nvme_delete_cq
                                                adapter_async_del_queue
                                                 nvme_submit_admin_async_cmd
                                                  cmdinfo->req = req;
      
           blk_mq_free_request(cmdinfo->req); <-- wrong request!!!
      
      This patch fixes the bug by releasing the request in the completion side
      prior to waking the submission thread, such that that thread can't muck
      with the shared completion context.
      
      Fixes: a4aea562 ("NVMe: Convert to blk-mq")
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c325770
    • Mark Rutland's avatar
      arm64: fix dump_instr when PAN and UAO are in use · 4310b6bf
      Mark Rutland authored
      commit c5cea06b upstream.
      
      If the kernel is set to show unhandled signals, and a user task does not
      handle a SIGILL as a result of an instruction abort, we will attempt to
      log the offending instruction with dump_instr before killing the task.
      
      We use dump_instr to log the encoding of the offending userspace
      instruction. However, dump_instr is also used to dump instructions from
      kernel space, and internally always switches to KERNEL_DS before dumping
      the instruction with get_user. When both PAN and UAO are in use, reading
      a user instruction via get_user while in KERNEL_DS will result in a
      permission fault, which leads to an Oops.
      
      As we have regs corresponding to the context of the original instruction
      abort, we can inspect this and only flip to KERNEL_DS if the original
      abort was taken from the kernel, avoiding this issue. At the same time,
      remove the redundant (and incorrect) comments regarding the order
      dump_mem and dump_instr are called in.
      
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reported-by: default avatarVladimir Murzin <vladimir.murzin@arm.com>
      Tested-by: default avatarVladimir Murzin <vladimir.murzin@arm.com>
      Fixes: 57f4959b ("arm64: kernel: Add support for User Access Override")
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4310b6bf
    • Lukas Wunner's avatar
      serial: omap: Fix EFR write on RTS deassertion · 1df403ab
      Lukas Wunner authored
      commit 2a71de2f upstream.
      
      Commit 348f9bb3 ("serial: omap: Fix RTS handling") sought to enable
      auto RTS upon manual RTS assertion and disable it on deassertion.
      However it seems the latter was done incorrectly, it clears all bits in
      the Extended Features Register *except* auto RTS.
      
      Fixes: 348f9bb3 ("serial: omap: Fix RTS handling")
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1df403ab
    • Roberto Sassu's avatar
      ima: do not update security.ima if appraisal status is not INTEGRITY_PASS · a9100b6f
      Roberto Sassu authored
      commit 020aae3e upstream.
      
      Commit b65a9cfc ("Untangling ima mess, part 2: deal with counters")
      moved the call of ima_file_check() from may_open() to do_filp_open() at a
      point where the file descriptor is already opened.
      
      This breaks the assumption made by IMA that file descriptors being closed
      belong to files whose access was granted by ima_file_check(). The
      consequence is that security.ima and security.evm are updated with good
      values, regardless of the current appraisal status.
      
      For example, if a file does not have security.ima, IMA will create it after
      opening the file for writing, even if access is denied. Access to the file
      will be allowed afterwards.
      
      Avoid this issue by checking the appraisal status before updating
      security.ima.
      Signed-off-by: default avatarRoberto Sassu <roberto.sassu@huawei.com>
      Signed-off-by: default avatarMimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: default avatarJames Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a9100b6f
    • Eric W. Biederman's avatar
      net/sctp: Always set scope_id in sctp_inet6_skb_msgname · 51b8aea7
      Eric W. Biederman authored
      
      [ Upstream commit 7c8a61d9 ]
      
      Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
      discovered that in some configurations sctp would leak 4 bytes of
      kernel stack.
      
      Working with his reproducer I discovered that those 4 bytes that
      are leaked is the scope id of an ipv6 address returned by recvmsg.
      
      With a little code inspection and a shrewd guess I discovered that
      sctp_inet6_skb_msgname only initializes the scope_id field for link
      local ipv6 addresses to the interface index the link local address
      pertains to instead of initializing the scope_id field for all ipv6
      addresses.
      
      That is almost reasonable as scope_id's are meaniningful only for link
      local addresses.  Set the scope_id in all other cases to 0 which is
      not a valid interface index to make it clear there is nothing useful
      in the scope_id field.
      
      There should be no danger of breaking userspace as the stack leak
      guaranteed that previously meaningless random data was being returned.
      
      Fixes: 372f525b ("SCTP:  Resync with LKSCTP tree.")
      History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gitReported-by: default avatarAlexander Potapenko <glider@google.com>
      Tested-by: default avatarAlexander Potapenko <glider@google.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51b8aea7
    • Huacai Chen's avatar
      fealnx: Fix building error on MIPS · ae93cefb
      Huacai Chen authored
      
      [ Upstream commit cc54c1d3 ]
      
      This patch try to fix the building error on MIPS. The reason is MIPS
      has already defined the LONG macro, which conflicts with the LONG enum
      in drivers/net/ethernet/fealnx.c.
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae93cefb
    • Xin Long's avatar
      sctp: do not peel off an assoc from one netns to another one · 2a0e6090
      Xin Long authored
      
      [ Upstream commit df80cd9b ]
      
      Now when peeling off an association to the sock in another netns, all
      transports in this assoc are not to be rehashed and keep use the old
      key in hashtable.
      
      As a transport uses sk->net as the hash key to insert into hashtable,
      it would miss removing these transports from hashtable due to the new
      netns when closing the sock and all transports are being freeed, then
      later an use-after-free issue could be caused when looking up an asoc
      and dereferencing those transports.
      
      This is a very old issue since very beginning, ChunYu found it with
      syzkaller fuzz testing with this series:
      
        socket$inet6_sctp()
        bind$inet6()
        sendto$inet6()
        unshare(0x40000000)
        getsockopt$inet_sctp6_SCTP_GET_ASSOC_ID_LIST()
        getsockopt$inet_sctp6_SCTP_SOCKOPT_PEELOFF()
      
      This patch is to block this call when peeling one assoc off from one
      netns to another one, so that the netns of all transport would not
      go out-sync with the key in hashtable.
      
      Note that this patch didn't fix it by rehashing transports, as it's
      difficult to handle the situation when the tuple is already in use
      in the new netns. Besides, no one would like to peel off one assoc
      to another netns, considering ipaddrs, ifaces, etc. are usually
      different.
      Reported-by: default avatarChunYu Wang <chunwang@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a0e6090
    • Jason A. Donenfeld's avatar
      af_netlink: ensure that NLMSG_DONE never fails in dumps · 4cfc0b41
      Jason A. Donenfeld authored
      
      [ Upstream commit 0642840b ]
      
      The way people generally use netlink_dump is that they fill in the skb
      as much as possible, breaking when nla_put returns an error. Then, they
      get called again and start filling out the next skb, and again, and so
      forth. The mechanism at work here is the ability for the iterative
      dumping function to detect when the skb is filled up and not fill it
      past the brim, waiting for a fresh skb for the rest of the data.
      
      However, if the attributes are small and nicely packed, it is possible
      that a dump callback function successfully fills in attributes until the
      skb is of size 4080 (libmnl's default page-sized receive buffer size).
      The dump function completes, satisfied, and then, if it happens to be
      that this is actually the last skb, and no further ones are to be sent,
      then netlink_dump will add on the NLMSG_DONE part:
      
        nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
      
      It is very important that netlink_dump does this, of course. However, in
      this example, that call to nlmsg_put_answer will fail, because the
      previous filling by the dump function did not leave it enough room. And
      how could it possibly have done so? All of the nla_put variety of
      functions simply check to see if the skb has enough tailroom,
      independent of the context it is in.
      
      In order to keep the important assumptions of all netlink dump users, it
      is therefore important to give them an skb that has this end part of the
      tail already reserved, so that the call to nlmsg_put_answer does not
      fail. Otherwise, library authors are forced to find some bizarre sized
      receive buffer that has a large modulo relative to the common sizes of
      messages received, which is ugly and buggy.
      
      This patch thus saves the NLMSG_DONE for an additional message, for the
      case that things are dangerously close to the brim. This requires
      keeping track of the errno from ->dump() across calls.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4cfc0b41
    • Cong Wang's avatar
      vlan: fix a use-after-free in vlan_device_event() · ef206ea7
      Cong Wang authored
      
      [ Upstream commit 052d41c0 ]
      
      After refcnt reaches zero, vlan_vid_del() could free
      dev->vlan_info via RCU:
      
      	RCU_INIT_POINTER(dev->vlan_info, NULL);
      	call_rcu(&vlan_info->rcu, vlan_info_rcu_free);
      
      However, the pointer 'grp' still points to that memory
      since it is set before vlan_vid_del():
      
              vlan_info = rtnl_dereference(dev->vlan_info);
              if (!vlan_info)
                      goto out;
              grp = &vlan_info->grp;
      
      Depends on when that RCU callback is scheduled, we could
      trigger a use-after-free in vlan_group_for_each_dev()
      right following this vlan_vid_del().
      
      Fix it by moving vlan_vid_del() before setting grp. This
      is also symmetric to the vlan_vid_add() we call in
      vlan_device_event().
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Fixes: efc73f4b ("net: Fix memory leak - vlan_info struct")
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Girish Moodalbail <girish.moodalbail@oracle.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: default avatarGirish Moodalbail <girish.moodalbail@oracle.com>
      Tested-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef206ea7
    • Hangbin Liu's avatar
      bonding: discard lowest hash bit for 802.3ad layer3+4 · 3bb6245e
      Hangbin Liu authored
      
      [ Upstream commit b5f86218 ]
      
      After commit 07f4c900 ("tcp/dccp: try to not exhaust ip_local_port_range
      in connect()"), we will try to use even ports for connect(). Then if an
      application (seen clearly with iperf) opens multiple streams to the same
      destination IP and port, each stream will be given an even source port.
      
      So the bonding driver's simple xmit_hash_policy based on layer3+4 addressing
      will always hash all these streams to the same interface. And the total
      throughput will limited to a single slave.
      
      Change the tcp code will impact the whole tcp behavior, only for bonding
      usage. Paolo Abeni suggested fix this by changing the bonding code only,
      which should be more reasonable, and less impact.
      
      Fix this by discarding the lowest hash bit because it contains little entropy.
      After the fix we can re-balance between slaves.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3bb6245e
    • Ye Yin's avatar
      netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed · 001e9cbe
      Ye Yin authored
      
      [ Upstream commit 2b5ec1a5 ]
      
      When run ipvs in two different network namespace at the same host, and one
      ipvs transport network traffic to the other network namespace ipvs.
      'ipvs_property' flag will make the second ipvs take no effect. So we should
      clear 'ipvs_property' when SKB network namespace changed.
      
      Fixes: 621e84d6 ("dev: introduce skb_scrub_packet()")
      Signed-off-by: default avatarYe Yin <hustcat@gmail.com>
      Signed-off-by: default avatarWei Zhou <chouryzhou@gmail.com>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      001e9cbe
    • Eric Dumazet's avatar
      tcp: do not mangle skb->cb[] in tcp_make_synack() · 0c1282c7
      Eric Dumazet authored
      
      [ Upstream commit 3b117750 ]
      
      Christoph Paasch sent a patch to address the following issue :
      
      tcp_make_synack() is leaving some TCP private info in skb->cb[],
      then send the packet by other means than tcp_transmit_skb()
      
      tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
      IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.
      
      tcp_make_synack() should not use tcp_init_nondata_skb() :
      
      tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
      queues (the ones that are only sent via tcp_transmit_skb())
      
      This patch fixes the issue and should even save few cpu cycles ;)
      
      Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c1282c7
  2. 21 Nov, 2017 27 commits