1. 17 Nov, 2017 17 commits
    • Elena Reshetova's avatar
      fs, nfs: convert pnfs_layout_segment.pls_refcount from atomic_t to refcount_t · eba6dd69
      Elena Reshetova authored
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      eba6dd69
    • Elena Reshetova's avatar
      fs, nfs: convert nfs4_pnfs_ds.ds_count from atomic_t to refcount_t · a2a5dea7
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable nfs4_pnfs_ds.ds_count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a2a5dea7
    • Trond Myklebust's avatar
      NFSv4.1: Fix up replays of interrupted requests · 3be0f80b
      Trond Myklebust authored
      If the previous request on a slot was interrupted before it was
      processed by the server, then our slot sequence number may be out of whack,
      and so we try the next operation using the old sequence number.
      
      The problem with this, is that not all servers check to see that the
      client is replaying the same operations as previously when they decide
      to go to the replay cache, and so instead of the expected error of
      NFS4ERR_SEQ_FALSE_RETRY, we get a replay of the old reply, which could
      (if the operations match up) be mistaken by the client for a new reply.
      
      To fix this, we attempt to send a COMPOUND containing only the SEQUENCE op
      in order to resync our slot sequence number.
      
      Cc: Olga Kornievskaia <olga.kornievskaia@gmail.com>
      [olga.kornievskaia@gmail.com: fix an Oops]
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      3be0f80b
    • Chuck Lever's avatar
      xprtrdma: Remove atomic send completion counting · 6f0afc28
      Chuck Lever authored
      The sendctx circular queue now guarantees that xprtrdma cannot
      overflow the Send Queue, so remove the remaining bits of the
      original Send WQE counting mechanism.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6f0afc28
    • Chuck Lever's avatar
      xprtrdma: RPC completion should wait for Send completion · 01bb35c8
      Chuck Lever authored
      When an RPC Call includes a file data payload, that payload can come
      from pages in the page cache, or a user buffer (for direct I/O).
      
      If the payload can fit inline, xprtrdma includes it in the Send
      using a scatter-gather technique. xprtrdma mustn't allow the RPC
      consumer to re-use the memory where that payload resides before the
      Send completes. Otherwise, the new contents of that memory would be
      exposed by an HCA retransmit of the Send operation.
      
      So, block RPC completion on Send completion, but only in the case
      where a separate file data payload is part of the Send. This
      prevents the reuse of that memory while it is still part of a Send
      operation without an undue cost to other cases.
      
      Waiting is avoided in the common case because typically the Send
      will have completed long before the RPC Reply arrives.
      
      These days, an RPC timeout will trigger a disconnect, which tears
      down the QP. The disconnect flushes all waiting Sends. This bounds
      the amount of time the reply handler has to wait for a Send
      completion.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      01bb35c8
    • Chuck Lever's avatar
      xprtrdma: Refactor rpcrdma_deferred_completion · 0ba6f370
      Chuck Lever authored
      Invoke a common routine for releasing hardware resources (for
      example, invalidating MRs). This needs to be done whether an
      RPC Reply has arrived or the RPC was terminated early.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      0ba6f370
    • Chuck Lever's avatar
      xprtrdma: Add a field of bit flags to struct rpcrdma_req · 531cca0c
      Chuck Lever authored
      We have one boolean flag in rpcrdma_req today. I'd like to add more
      flags, so convert that boolean to a bit flag.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      531cca0c
    • Chuck Lever's avatar
      xprtrdma: Add data structure to manage RDMA Send arguments · ae72950a
      Chuck Lever authored
      Problem statement:
      
      Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
      enabled storage initiators don't handle delayed Send completion
      correctly. If Send completion is delayed beyond the end of a ULP
      transaction, the ULP may release resources that are still being used
      by the HCA to complete a long-running Send operation.
      
      This is a common design trait amongst our initiators. Most Send
      operations are faster than the ULP transaction they are part of.
      Waiting for a completion for these is typically unnecessary.
      
      Infrequently, a network partition or some other problem crops up
      where an ordering problem can occur. In NFS parlance, the RPC Reply
      arrives and completes the RPC, but the HCA is still retrying the
      Send WR that conveyed the RPC Call. In this case, the HCA can try
      to use memory that has been invalidated or DMA unmapped, and the
      connection is lost. If that memory has been re-used for something
      else (possibly not related to NFS), and the Send retransmission
      exposes that data on the wire.
      
      Thus we cannot assume that it is safe to release Send-related
      resources just because a ULP reply has arrived.
      
      After some analysis, we have determined that the completion
      housekeeping will not be difficult for xprtrdma:
      
       - Inline Send buffers are registered via the local DMA key, and
         are already left DMA mapped for the lifetime of a transport
         connection, thus no additional handling is necessary for those
       - Gathered Sends involving page cache pages _will_ need to
         DMA unmap those pages after the Send completes. But like
         inline send buffers, they are registered via the local DMA key,
         and thus will not need to be invalidated
      
      In addition, RPC completion will need to wait for Send completion
      in the latter case. However, nearly always, the Send that conveys
      the RPC Call will have completed long before the RPC Reply
      arrives, and thus no additional latency will be accrued.
      
      Design notes:
      
      In this patch, the rpcrdma_sendctx object is introduced, and a
      lock-free circular queue is added to manage a set of them per
      transport.
      
      The RPC client's send path already prevents sending more than one
      RPC Call at the same time. This allows us to treat the consumer
      side of the queue (rpcrdma_sendctx_get_locked) as if there is a
      single consumer thread.
      
      The producer side of the queue (rpcrdma_sendctx_put_locked) is
      invoked only from the Send completion handler, which is a single
      thread of execution (soft IRQ).
      
      The only care that needs to be taken is with the tail index, which
      is shared between the producer and consumer. Only the producer
      updates the tail index. The consumer compares the head with the
      tail to ensure that the a sendctx that is in use is never handed
      out again (or, expressed more conventionally, the queue is empty).
      
      When the sendctx queue empties completely, there are enough Sends
      outstanding that posting more Send operations can result in a Send
      Queue overflow. In this case, the ULP is told to wait and try again.
      This introduces strong Send Queue accounting to xprtrdma.
      
      As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      suggested a mechanism that does not require signaling every Send.
      We signal once every N Sends, and perform SGE unmapping of N Send
      operations during that one completion.
      Reported-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Suggested-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ae72950a
    • Chuck Lever's avatar
      xprtrdma: "Unoptimize" rpcrdma_prepare_hdr_sge() · a062a2a3
      Chuck Lever authored
      Commit 655fec69 ("xprtrdma: Use gathered Send for large inline
      messages") assumed that, since the zeroeth element of the Send SGE
      array always pointed to req->rl_rdmabuf, it needed to be initialized
      just once. This was a valid assumption because the Send SGE array
      and rl_rdmabuf both live in the same rpcrdma_req.
      
      In a subsequent patch, the Send SGE array will be separated from the
      rpcrdma_req, so the zeroeth element of the SGE array needs to be
      initialized every time.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a062a2a3
    • Chuck Lever's avatar
      xprtrdma: Change return value of rpcrdma_prepare_send_sges() · 857f9aca
      Chuck Lever authored
      Clean up: Make rpcrdma_prepare_send_sges() return a negative errno
      instead of a bool. Soon callers will want distinct treatments of
      different types of failures.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      857f9aca
    • Chuck Lever's avatar
      xprtrdma: Fix error handling in rpcrdma_prepare_msg_sges() · 394b2c77
      Chuck Lever authored
      When this function fails, it needs to undo the DMA mappings it's
      done so far. Otherwise these are leaked.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      394b2c77
    • Chuck Lever's avatar
      xprtrdma: Clean up SGE accounting in rpcrdma_prepare_msg_sges() · ad99f053
      Chuck Lever authored
      Clean up. rpcrdma_prepare_hdr_sge() sets num_sge to one, then
      rpcrdma_prepare_msg_sges() sets num_sge again to the count of SGEs
      it added, plus one for the header SGE just mapped in
      rpcrdma_prepare_hdr_sge(). This is confusing, and nails in an
      assumption about when these functions are called.
      
      Instead, maintain a running count that both functions can update
      with just the number of SGEs they have added to the SGE array.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ad99f053
    • Chuck Lever's avatar
      xprtrdma: Decode credits field in rpcrdma_reply_handler · be798f90
      Chuck Lever authored
      We need to decode and save the incoming rdma_credits field _after_
      we know that the direction of the message is "forward direction
      Reply". Otherwise, the credits value in reverse direction Calls is
      also used to update the forward direction credits.
      
      It is safe to decode the rdma_credits field in rpcrdma_reply_handler
      now that rpcrdma_reply_handler is single-threaded. Receives complete
      in the same order as they were sent on the NFS server.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      be798f90
    • Chuck Lever's avatar
      xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completion · d8f532d2
      Chuck Lever authored
      I noticed that the soft IRQ thread looked pretty busy under heavy
      I/O workloads. perf suggested one area that was expensive was the
      queue_work() call in rpcrdma_wc_receive. That gave me some ideas.
      
      Instead of scheduling a separate worker to process RPC Replies,
      promote the Receive completion handler to IB_POLL_WORKQUEUE, and
      invoke rpcrdma_reply_handler directly.
      
      Note that the poll workqueue is single-threaded. In order to keep
      memory invalidation from serializing all RPC Replies, handle any
      necessary invalidation tasks in a separate multi-threaded workqueue.
      
      This provides a two-tier scheme, similar to OS I/O interrupt
      handlers: A fast interrupt handler that schedules the slow handler
      and re-enables the interrupt, and a slower handler that is invoked
      for any needed heavy lifting.
      
      Benefits include:
      - One less context switch for RPCs that don't register memory
      - Receive completion handling is moved out of soft IRQ context to
        make room for other users of soft IRQ
      - The same CPU core now DMA syncs and XDR decodes the Receive buffer
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d8f532d2
    • Chuck Lever's avatar
      xprtrdma: Refactor rpcrdma_reply_handler some more · e1352c96
      Chuck Lever authored
      Clean up: I'd like to be able to invoke the tail of
      rpcrdma_reply_handler in two different places. Split the tail out
      into its own helper function.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      e1352c96
    • Chuck Lever's avatar
      xprtrdma: Move decoded header fields into rpcrdma_rep · 5381e0ec
      Chuck Lever authored
      Clean up: Make it easier to pass the decoded XID, vers, credits, and
      proc fields around by moving these variables into struct rpcrdma_rep.
      
      Note: the credits field will be handled in a subsequent patch.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5381e0ec
    • Chuck Lever's avatar
      xprtrdma: Throw away reply when version is unrecognized · 61433af5
      Chuck Lever authored
      A reply with an unrecognized value in the version field means the
      transport header is potentially garbled and therefore all the fields
      are untrustworthy.
      
      Fixes: 59aa1f9a ("xprtrdma: Properly handle RDMA_ERROR ... ")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      61433af5
  2. 16 Oct, 2017 8 commits
    • Chuck Lever's avatar
      xprtrdma: Remove ro_unmap_safe · 2b4f8923
      Chuck Lever authored
      Clean up: There are no remaining callers of this method.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      2b4f8923
    • Chuck Lever's avatar
      xprtrdma: Use ro_unmap_sync in xprt_rdma_send_request · 4ce6c04c
      Chuck Lever authored
      The "safe" version of ro_unmap is used here to avoid waiting
      unnecessarily. However:
      
       - It is safe to wait. After all, we have to wait anyway when using
         FMR to register memory.
      
       - This case is rare: it occurs only after a reconnect.
      
      By switching this call site to ro_unmap_sync, the final use of
      ro_unmap_safe is removed.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      4ce6c04c
    • Chuck Lever's avatar
      xprtrdma: Don't defer fencing an async RPC's chunks · 8f66b1a5
      Chuck Lever authored
      In current kernels, waiting in xprt_release appears to be safe to
      do. I had erroneously believed that for ASYNC RPCs, waiting of any
      kind in xprt_release->xprt_rdma_free would result in deadlock. I've
      done injection testing and consulted with Trond to confirm that
      waiting in the RPC release path is safe.
      
      For the very few times where RPC resources haven't yet been released
      earlier by the reply handler, it is safe to wait synchronously in
      xprt_rdma_free for invalidation rather than defering it to MR
      recovery.
      
      Note: When the QP is error state, posting a LocalInvalidate should
      flush and mark the MR as bad. There is no way the remote HCA can
      access that MR via a QP in error state, so it is effectively already
      inaccessible and thus safe for the Upper Layer to access. The next
      time the MR is used it should be recognized and cleaned up properly
      by frwr_op_map.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      8f66b1a5
    • NeilBrown's avatar
      NFS: remove special-case revalidate in nfs_opendir() · 1fea73ac
      NeilBrown authored
      Commit f5a73672 ("NFS: allow close-to-open cache semantics to
      apply to root of NFS filesystem") added a call to
      __nfs_revalidate_inode() to nfs_opendir to as the lookup
      process wouldn't reliable do this.
      
      Subsequent commit a3fbbde7 ("VFS: we need to set LOOKUP_JUMPED
      on mountpoint crossing") make this unnecessary.  So remove the
      unnecessary code.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1fea73ac
    • NeilBrown's avatar
      NFS: revalidate "." etc correctly on "open". · b688741c
      NeilBrown authored
      For correct close-to-open semantics, NFS must validate
      the change attribute of a directory (or file) on open.
      
      Since commit ecf3d1f1 ("vfs: kill FS_REVAL_DOT by adding a
      d_weak_revalidate dentry op"), open() of "." or a path ending ".." is
      not revalidated reliably (except when that direct is a mount point).
      
      Prior to that commit, "." was revalidated using nfs_lookup_revalidate()
      which checks the LOOKUP_OPEN flag and forces revalidation if the flag is
      set.
      Since that commit, nfs_weak_revalidate() is used for NFSv3 (which
      ignores the flags) and nothing is used for NFSv4.
      
      This is fixed by using nfs_lookup_verify_inode() in
      nfs_weak_revalidate().  This does the revalidation exactly when needed.
      Also, add a definition of .d_weak_revalidate for NFSv4.
      
      The incorrect behavior is easily demonstrated by running "echo *" in
      some non-mountpoint NFS directory while watching network traffic.
      Without this patch, "echo *" sometimes doesn't produce any traffic.
      With the patch it always does.
      
      Fixes: ecf3d1f1 ("vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op")
      cc: stable@vger.kernel.org (3.9+)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      b688741c
    • Anna Schumaker's avatar
      NFS: Don't compare apples to elephants to determine access bits · 1750d929
      Anna Schumaker authored
      The NFS_ACCESS_* flags aren't a 1:1 mapping to the MAY_* flags, so
      checking for MAY_WHATEVER might have surprising results in
      nfs*_proc_access().  Let's simplify this check when determining which
      bits to ask for, and do it in a generic place instead of copying code
      for each NFS version.
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1750d929
    • Anna Schumaker's avatar
      NFS: Create NFS_ACCESS_* flags · 3c181827
      Anna Schumaker authored
      Passing the NFS v4 flags into the v3 code seems weird to me, even if
      they are defined to the same values.  This patch adds in generic flags
      to help me feel better
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      3c181827
    • Linus Torvalds's avatar
      Linux 4.14-rc5 · 33d930e5
      Linus Torvalds authored
      33d930e5
  3. 15 Oct, 2017 3 commits
    • Linus Torvalds's avatar
      Merge tag 'char-misc-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · ae7df8f9
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are 4 patches to resolve some char/misc driver issues found these
        past weeks.
      
        One of them is a mei bugfix and another is a new mei device id. There
        is also a hyper-v fix for a reported issue, and a binder issue fix for
        a problem reported by a few people.
      
        All of these have been in my tree for a while, I don't know if
        linux-next is really testing much this month. But 0-day is happy with
        them :)"
      
      * tag 'char-misc-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        binder: fix use-after-free in binder_transaction()
        Drivers: hv: vmbus: Fix bugs in rescind handling
        mei: me: add gemini lake devices id
        mei: always use domain runtime pm callbacks.
      ae7df8f9
    • Linus Torvalds's avatar
      Merge tag 'usb-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 7a263b16
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a handful of USB driver fixes for 4.14-rc5.
      
        There is the "usual" usb-serial fixes and device ids, USB gadget
        fixes, and some more fixes found by the fuzz testing that is happening
        on the USB layer right now.
      
        All of these have been in my tree this week with no reported issues"
      
      * tag 'usb-4.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: usbtest: fix NULL pointer dereference
        usb: gadget: configfs: Fix memory leak of interface directory data
        usb: gadget: composite: Fix use-after-free in usb_composite_overwrite_options
        usb: misc: usbtest: Fix overflow in usbtest_do_ioctl()
        usb: renesas_usbhs: Fix DMAC sequence for receiving zero-length packet
        USB: dummy-hcd: Fix deadlock caused by disconnect detection
        usb: phy: tegra: Fix phy suspend for UDC
        USB: serial: console: fix use-after-free after failed setup
        USB: serial: console: fix use-after-free on disconnect
        USB: serial: qcserial: add Dell DW5818, DW5819
        USB: serial: cp210x: add support for ELV TFD500
        USB: serial: cp210x: fix partnum regression
        USB: serial: option: add support for TP-Link LTE module
        USB: serial: ftdi_sio: add id for Cypress WICED dev board
      7a263b16
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-4.14-rc5' of git://git.infradead.org/users/vkoul/slave-dma · 7a23c5ab
      Linus Torvalds authored
      Pull dmaengine fixes from Vinod Koul:
       "Here are fixes for this round
      
         - fix spinlock usage amd fifo response for altera driver
      
         - fix ti crossbar race condition
      
         - fix edma memcpy align"
      
      * tag 'dmaengine-fix-4.14-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: altera: fix spinlock usage
        dmaengine: altera: fix response FIFO emptying
        dmaengine: ti-dma-crossbar: Fix possible race condition with dma_inuse
        dmaengine: edma: Align the memcpy acnt array size with the transfer
      7a23c5ab
  4. 14 Oct, 2017 12 commits
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e7a36a6e
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "A landry list of fixes:
      
         - fix reboot breakage on some PCID-enabled system
      
         - fix crashes/hangs on some PCID-enabled systems
      
         - fix microcode loading on certain older CPUs
      
         - various unwinder fixes
      
         - extend an APIC quirk to more hardware systems and disable APIC
           related warning on virtualized systems
      
         - various Hyper-V fixes
      
         - a macro definition robustness fix
      
         - remove jprobes IRQ disabling
      
         - various mem-encryption fixes"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/microcode: Do the family check first
        x86/mm: Flush more aggressively in lazy TLB mode
        x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping
        x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on hypervisors
        x86/mm: Disable various instrumentations of mm/mem_encrypt.c and mm/tlb.c
        x86/hyperv: Fix hypercalls with extended CPU ranges for TLB flushing
        x86/hyperv: Don't use percpu areas for pcpu_flush/pcpu_flush_ex structures
        x86/hyperv: Clear vCPU banks between calls to avoid flushing unneeded vCPUs
        x86/unwind: Disable unwinder warnings on 32-bit
        x86/unwind: Align stack pointer in unwinder dump
        x86/unwind: Use MSB for frame pointer encoding on 32-bit
        x86/unwind: Fix dereference of untrusted pointer
        x86/alternatives: Fix alt_max_short macro to really be a max()
        x86/mm/64: Fix reboot interaction with CR4.PCIDE
        kprobes/x86: Remove IRQ disabling from jprobe handlers
        kprobes/x86: Set up frame pointer in kprobe trampoline
      e7a36a6e
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a339b351
      Linus Torvalds authored
      Pull scheduler fixes from Ingo Molnar:
       "Three fixes that address an SMP balancing performance regression"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/core: Ensure load_balance() respects the active_mask
        sched/core: Address more wake_affine() regressions
        sched/core: Fix wake_affine() performance regression
      a339b351
    • Linus Torvalds's avatar
      Merge branch 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7b764ced
      Linus Torvalds authored
      Pull RAS fixes from Ingo Molnar:
       "A boot parameter fix, plus a header export fix"
      
      * 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mce: Hide mca_cfg
        RAS/CEC: Use the right length for "cec_disable"
      7b764ced
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 26c923ab
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "Some tooling fixes plus three kernel fixes: a memory leak fix, a
        statistics fix and a crash fix"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel/uncore: Fix memory leaks on allocation failures
        perf/core: Fix cgroup time when scheduling descendants
        perf/core: Avoid freeing static PMU contexts when PMU is unregistered
        tools include uapi bpf.h: Sync kernel ABI header with tooling header
        perf pmu: Unbreak perf record for arm/arm64 with events with explicit PMU
        perf script: Add missing separator for "-F ip,brstack" (and brstackoff)
        perf callchain: Compare dsos (as well) for CCKEY_FUNCTION
      26c923ab
    • Linus Torvalds's avatar
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 60a6ca6c
      Linus Torvalds authored
      Pull locking fixes from Ingo Molnar:
       "Two lockdep fixes for bugs introduced by the cross-release dependency
        tracking feature - plus a commit that disables it because performance
        regressed in an absymal fashion on some systems"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/lockdep: Disable cross-release features for now
        locking/selftest: Avoid false BUG report
        locking/lockdep: Fix stacktrace mess
      60a6ca6c
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2b34218e
      Linus Torvalds authored
      Pull irq fixes from Ingo Molnar:
       "A CPU hotplug related fix, plus two related sanity checks"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq/cpuhotplug: Enforce affinity setting on startup of managed irqs
        genirq/cpuhotplug: Add sanity check for effective affinity mask
        genirq: Warn when effective affinity is not updated
      2b34218e
    • Linus Torvalds's avatar
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a515d05e
      Linus Torvalds authored
      Pull objtool fix from Ingo Molnar:
       "A single objtool fix: avoid silently broken ORC debuginfo builds and
        error out instead"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        objtool: Upgrade libelf-devel warning to error for CONFIG_ORC_UNWINDER
      a515d05e
    • Borislav Petkov's avatar
      x86/microcode: Do the family check first · 1f161f67
      Borislav Petkov authored
      On CPUs like AMD's Geode, for example, we shouldn't even try to load
      microcode because they do not support the modern microcode loading
      interface.
      
      However, we do the family check *after* the other checks whether the
      loader has been disabled on the command line or whether we're running in
      a guest.
      
      So move the family checks first in order to exit early if we're being
      loaded on an unsupported family.
      Reported-and-tested-by: default avatarSven Glodowski <glodi1@arcor.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org> # 4.11..
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://bugzilla.suse.com/show_bug.cgi?id=1061396
      Link: http://lkml.kernel.org/r/20171012112316.977-1-bp@alien8.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1f161f67
    • Ingo Molnar's avatar
      locking/lockdep: Disable cross-release features for now · b483cf3b
      Ingo Molnar authored
      Johan Hovold reported a big lockdep slowdown on his system, caused by lockdep:
      
      > I had noticed that the BeagleBone Black boot time appeared to have
      > increased significantly with 4.14 and yesterday I finally had time to
      > investigate it.
      >
      > Boot time (from "Linux version" to login prompt) had in fact doubled
      > since 4.13 where it took 17 seconds (with my current config) compared to
      > the 35 seconds I now see with 4.14-rc4.
      >
      > I quick bisect pointed to lockdep and specifically the following commit:
      >
      >	28a903f6 ("locking/lockdep: Handle non(or multi)-acquisition of a crosslock")
      
      Because the final v4.14 release is close, disable the cross-release lockdep
      features for now.
      Bisected-by: default avatarJohan Hovold <johan@kernel.org>
      Debugged-by: default avatarJohan Hovold <johan@kernel.org>
      Reported-by: default avatarJohan Hovold <johan@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: kernel-team@lge.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mm@kvack.org
      Cc: linux-omap@vger.kernel.org
      Link: http://lkml.kernel.org/r/20171014072659.f2yr6mhm5ha3eou7@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b483cf3b
    • Linus Torvalds's avatar
      Merge branch '4.14-fixes' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · be1f16ba
      Linus Torvalds authored
      Pull MIPS fixes from Ralf Baechle:
       "More MIPS fixes for 4.14:
      
         - Loongson 1: Set the default number of RX and TX queues to
           accomodate for recent changes of stmmac driver.
      
         - BPF: Fix uninitialised target compiler error.
      
         - Fix cmpxchg on 32 bit signed ints for 64 bit kernels with
           !kernel_uses_llsc
      
         - Fix generic-board-config.sh for builds using O=
      
         - Remove pr_err() calls from fpu_emu() for a case which is not a
           kernel error"
      
      * '4.14-fixes' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: math-emu: Remove pr_err() calls from fpu_emu()
        MIPS: Fix generic-board-config.sh for builds using O=
        MIPS: Fix cmpxchg on 32b signed ints for 64b kernel with !kernel_uses_llsc
        MIPS: loongson1: set default number of rx and tx queues for stmmac
        MIPS: bpf: Fix uninitialised target compiler error
      be1f16ba
    • Andy Lutomirski's avatar
      x86/mm: Flush more aggressively in lazy TLB mode · b956575b
      Andy Lutomirski authored
      Since commit:
      
        94b1b03b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
      
      x86's lazy TLB mode has been all the way lazy: when running a kernel thread
      (including the idle thread), the kernel keeps using the last user mm's
      page tables without attempting to maintain user TLB coherence at all.
      
      From a pure semantic perspective, this is fine -- kernel threads won't
      attempt to access user pages, so having stale TLB entries doesn't matter.
      
      Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
      we also allow any paging-structure caches that may exist on the CPU
      to become incoherent.  This means that we can have a
      paging-structure cache entry that references a freed page table, and
      the CPU is within its rights to do a speculative page walk starting
      at the freed page table.
      
      I can imagine this causing two different problems:
      
       - A speculative page walk starting from a bogus page table could read
         IO addresses.  I haven't seen any reports of this causing problems.
      
       - A speculative page walk that involves a bogus page table can install
         garbage in the TLB.  Such garbage would always be at a user VA, but
         some AMD CPUs have logic that triggers a machine check when it notices
         these bogus entries.  I've seen a couple reports of this.
      
      Boris further explains the failure mode:
      
      > It is actually more of an optimization which assumes that paging-structure
      > entries are in WB DRAM:
      >
      > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
      > performance optimization that assumes PML4, PDP, PDE, and PTE entries
      > are in cacheable WB-DRAM; memory type checks may be bypassed, and
      > addresses outside of WB-DRAM may result in undefined behavior or NB
      > protocol errors. 1=Disables performance optimization and allows PML4,
      > PDP, PDE and PTE entries to be in any memory type. Operating systems
      > that maintain page tables in memory types other than WB- DRAM must set
      > TlbCacheDis to insure proper operation."
      >
      > The MCE generated is an NB protocol error to signal that
      >
      > "Link: A specific coherent-only packet from a CPU was issued to an
      > IO link. This may be caused by software which addresses page table
      > structures in a memory type other than cacheable WB-DRAM without
      > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
      > example, when page table structure addresses are above top of memory. In
      > such cases, the NB will generate an MCE if it sees a mismatch between
      > the memory operation generated by the core and the link type."
      >
      > I'm assuming coherent-only packets don't go out on IO links, thus the
      > error.
      
      To fix this, reinstate TLB coherence in lazy mode.  With this patch
      applied, we do it in one of two ways:
      
       - If we have PCID, we simply switch back to init_mm's page tables
         when we enter a kernel thread -- this seems to be quite cheap
         except for the cost of serializing the CPU.
      
       - If we don't have PCID, then we set a flag and switch to init_mm
         the first time we would otherwise need to flush the TLB.
      
      The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
      to override the default mode for benchmarking.
      
      In theory, we could optimize this better by only flushing the TLB in
      lazy CPUs when a page table is freed.  Doing that would require
      auditing the mm code to make sure that all page table freeing goes
      through tlb_remove_page() as well as reworking some data structures
      to implement the improved flush logic.
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Reported-by: default avatarAdam Borowski <kilobyte@angband.pl>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 94b1b03b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
      Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnicSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b956575b
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-for-v4.14-rc5' of git://people.freedesktop.org/~airlied/linux · 9aa0d2dd
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Couple of the arm people seem to wake up so this has imx and msm
        fixes, along with a bunch of i915 stable bounds fixes and an amdgpu
        regression fix.
      
        All seems pretty okay for now"
      
      * tag 'drm-fixes-for-v4.14-rc5' of git://people.freedesktop.org/~airlied/linux:
        drm/msm: fix _NO_IMPLICIT fencing case
        drm/msm: fix error path cleanup
        drm/msm/mdp5: Remove extra pm_runtime_put call in mdp5_crtc_cursor_set()
        drm/msm/dsi: Use correct pm_runtime_put variant during host_init
        drm/msm: fix return value check in _msm_gem_kernel_new()
        drm/msm: use proper memory barriers for updating tail/head
        drm/msm/mdp5: add missing max size for 8x74 v1
        drm/amdgpu: fix placement flags in amdgpu_ttm_bind
        drm/i915/bios: parse DDI ports also for CHV for HDMI DDC pin and DP AUX channel
        gpu: ipu-v3: pre: implement workaround for ERR009624
        gpu: ipu-v3: prg: wait for double buffers to be filled on channel startup
        gpu: ipu-v3: Allow channel burst locking on i.MX6 only
        drm/i915: Read timings from the correct transcoder in intel_crtc_mode_get()
        drm/i915: Order two completing nop_submit_request
        drm/i915: Silence compiler warning for hsw_power_well_enable()
        drm/i915: Use crtc_state_is_legacy_gamma in intel_color_check
        drm/i915/edp: Increase the T12 delay quirk to 1300ms
        drm/i915/edp: Get the Panel Power Off timestamp after panel is off
        sync_file: Return consistent status in SYNC_IOC_FILE_INFO
        drm/atomic: Unref duplicated drm_atomic_state in drm_atomic_helper_resume()
      9aa0d2dd