1. 15 Jan, 2018 22 commits
    • Trond Myklebust's avatar
      SUNRPC: Chunk reading of replies from the server · 3d188805
      Trond Myklebust authored
      Read the TCP data in chunks of max 2MB so that we do not hog the
      socket lock.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      3d188805
    • Chuck Lever's avatar
      SUNRPC: Remove rpc_protocol() · 024fbf9c
      Chuck Lever authored
      Since nfs4_create_referral_server was the only call site of
      rpc_protocol, rpc_protocol can now be removed.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      024fbf9c
    • Chuck Lever's avatar
      nfs: Update server port after referral or migration · 801b5643
      Chuck Lever authored
      After traversing a referral or recovering from a migration event,
      ensure that the server port reported in /proc/mounts is updated
      to the correct port setting for the new submount.
      Reported-by: default avatarHelen Chao <helen.chao@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      801b5643
    • Chuck Lever's avatar
      nfs: Referrals should use the same proto setting as their parent · 530ea421
      Chuck Lever authored
      Helen Chao <helen.chao@oracle.com> noticed that when a user
      traverses a referral on an NFS/RDMA mount, the resulting submount
      always uses TCP.
      
      This behavior does not match the vers= setting when traversing
      a referral (vers=4.1 is preserved). It also does not match the
      behavior of crossing from the pseudofs into a real filesystem
      (proto=rdma is preserved in that case).
      
      The Linux NFS client does not currently support the
      fs_locations_info attribute. The situation is similar for all
      NFSv4 servers I know of. Therefore until the community has broad
      support for fs_locations_info, when following a referral:
      
       - First try to connect with RPC-over-RDMA. This will fail quickly
         if the client has no RDMA-capable interfaces.
      
       - If connecting with RPC-over-RDMA fails, or the RPC-over-RDMA
         transport is not available, use TCP.
      Reported-by: default avatarHelen Chao <helen.chao@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      530ea421
    • Chuck Lever's avatar
      nfs: Define NFS_RDMA_PORT · fb455baa
      Chuck Lever authored
      The NFS/RDMA port assignment is specified in Section 9 of RFC 8267.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      fb455baa
    • Elena Reshetova's avatar
      lockd: convert nlm_rqst.a_count from atomic_t to refcount_t · fbca30c5
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable nlm_rqst.a_count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      **Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the nlm_rqst.a_count it might make a difference
      in following places:
       - nlmclnt_release_call() and nlmsvc_release_call(): decrement
         in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      fbca30c5
    • Elena Reshetova's avatar
      lockd: convert nlm_lockowner.count from atomic_t to refcount_t · 431f125b
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable nlm_lockowner.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      **Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the nlm_lockowner.count it might make a difference
      in following places:
       - nlm_put_lockowner(): decrement in refcount_dec_and_lock() only
         provides RELEASE ordering, control dependency on success and
         holds a spin lock on success vs. fully ordered atomic counterpart.
         No changes in spin lock guarantees.
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      431f125b
    • Elena Reshetova's avatar
      lockd: convert nsm_handle.sm_count from atomic_t to refcount_t · c751082c
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable nsm_handle.sm_count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      **Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the nsm_handle.sm_count it might make a difference
      in following places:
       - nsm_release(): decrement in refcount_dec_and_lock() only
         provides RELEASE ordering, control dependency on success
         and holds a spin lock on success vs. fully ordered atomic
         counterpart. No change for the spin lock guarantees.
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      c751082c
    • Elena Reshetova's avatar
      lockd: convert nlm_host.h_count from atomic_t to refcount_t · fee21fb5
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable nlm_host.h_count  is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      **Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the nlm_host.h_count it might make a difference
      in following places:
       - nlmsvc_release_host(): decrement in refcount_dec()
         provides RELEASE ordering, while original atomic_dec()
         was fully unordered. Since the change is for better, it
         should not matter.
       - nlmclnt_release_host(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart. It doesn't seem to
         matter in this case since object freeing happens under mutex
         lock anyway.
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      fee21fb5
    • Scott Mayhew's avatar
      nfs/pnfs: fix nfs_direct_req ref leak when i/o falls back to the mds · ba4a76f7
      Scott Mayhew authored
      Currently when falling back to doing I/O through the MDS (via
      pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
      without releasing the reference taken on the dreq
      via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
      nfs_direct_pgio_init.  It then takes another reference on the dreq via
      nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
      as a result the requester will become stuck in inode_dio_wait.  Once
      that happens, other processes accessing the inode will become stuck as
      well.
      
      Ensure that pnfs_read_through_mds() and pnfs_write_through_mds() clean
      up correctly by calling hdr->completion_ops->completion() instead of
      calling hdr->release() directly.
      
      This can be reproduced (sometimes) by performing "storage failover
      takeover" commands on NetApp filer while doing direct I/O from a client.
      
      This can also be reproduced using SystemTap to simulate a failure while
      doing direct I/O from a client (from Dave Wysochanski
      <dwysocha@redhat.com>):
      
      stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }'
      Suggested-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarScott Mayhew <smayhew@redhat.com>
      Fixes: 1ca018d2 ("pNFS: Fix a memory leak when attempted pnfs fails")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      ba4a76f7
    • Benjamin Coddington's avatar
      pnfs/blocklayout: handle transient devices · b3dce6a2
      Benjamin Coddington authored
      PNFS block/SCSI layouts should gracefully handle cases where block devices
      are not available when a layout is retrieved, or the block devices are
      removed while the client holds a layout.
      
      While setting up a layout segment, keep a record of an unavailable or
      un-parsable block device in cache with a flag so that subsequent layouts do
      not spam the server with GETDEVINFO.  We can reuse the current
      NFS_DEVICEID_UNAVAILABLE handling with one variation: instead of reusing
      the device, we will discard it and send a fresh GETDEVINFO after the
      timeout, since the lookup and validation of the device occurs within the
      GETDEVINFO response handling.
      
      A lookup of a layout segment that references an unavailable device will
      return a segment with the NFS_LSEG_UNAVAILABLE flag set.  This will allow
      the pgio layer to mark the layout with the appropriate fail bit, which
      forces subsequent IO to the MDS, and prevents spamming the server with
      LAYOUTGET, LAYOUTRETURN.
      
      Finally, when IO to a block device fails, look up the block device(s)
      referenced by the pgio header, and mark them as unavailable.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      b3dce6a2
    • Benjamin Coddington's avatar
      pnfs/blocklayout: set PNFS_LAYOUTRETURN_ON_ERROR · d78471d3
      Benjamin Coddington authored
      If there's an error doing I/O to block device, and the client resends the
      I/O to the MDS, the MDS must recall the layout from the client before
      processing the I/O.  Let's preempt that exchange by returning the layout
      before falling back to the MDS when there's an error.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      d78471d3
    • Benjamin Coddington's avatar
      pnfs/blocklayout: Add module alias for LAYOUT4_SCSI · ad6b0241
      Benjamin Coddington authored
      The blocklayout module contains the client support for both block and SCSI
      layouts.  Add a module alias for the SCSI layout type so that the module
      will be loaded for SCSI layouts.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      ad6b0241
    • Benjamin Coddington's avatar
      NFS: remove unused offset arg in nfs_pgio_rpcsetup · e545735a
      Benjamin Coddington authored
      nfs_pgio_rpcsetup() is always called with an offset of 0, so we should be
      able to drop the arguement altogether.
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      e545735a
    • NeilBrown's avatar
      NFSv4: always set NFS_LOCK_LOST when a lock is lost. · dce2630c
      NeilBrown authored
      There are 2 comments in the NFSv4 code which suggest that
      SIGLOST should possibly be sent to a process.  In these
      cases a lock has been lost.
      The current practice is to set NFS_LOCK_LOST so that
      read/write returns EIO when a lock is lost.
      So change these comments to code when sets NFS_LOCK_LOST.
      
      One case is when lock recovery after apparent server restart
      fails with NFS4ERR_DENIED, NFS4ERR_RECLAIM_BAD, or
      NFS4ERRO_RECLAIM_CONFLICT.  The other case is when a lock
      attempt as part of lease recovery fails with NFS4ERR_DENIED.
      
      In an ideal world, these should not happen.  However I have
      a packet trace showing an NFSv4.1 session getting
      NFS4ERR_BADSESSION after an extended network parition.  The
      NFSv4.1 client treats this like server reboot until/unless
      it get NFS4ERR_NO_GRACE, in which case it switches over to
      "nograce" recovery mode.  In this network trace, the client
      attempts to recover a lock and the server (incorrectly)
      reports NFS4ERR_DENIED rather than NFS4ERR_NO_GRACE.  This
      leads to the ineffective comment and the client then
      continues to write using the OPEN stateid.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      dce2630c
    • NeilBrown's avatar
      nfs: remove dead code from nfs_encode_fh() · aaa15008
      NeilBrown authored
      This code can never be used as the IS_AUTOMOUNT(inode)
      case has already been handled.
      So remove it to avoid confusion.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      aaa15008
    • Trond Myklebust's avatar
      Support statx() mask and query flags parameters · 9ccee940
      Trond Myklebust authored
      Support the query flags AT_STATX_FORCE_SYNC by forcing an attribute
      revalidation, and AT_STATX_DONT_SYNC by returning cached attributes
      only.
      
      Use the mask to optimise away server revalidation for attributes
      that are not being requested by the user.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      9ccee940
    • Trond Myklebust's avatar
      NFS: Fix nfsstat breakage due to LOOKUPP · 8634ef5e
      Trond Myklebust authored
      The LOOKUPP operation was inserted into the nfs4_procedures array
      rather than being appended, which put /proc/net/rpc/nfs out of
      whack, and broke the nfsstat utility.
      Fix by moving the LOOKUPP operation to the end of the array, and
      by ensuring that it keeps the same length whether or not NFSV4.1
      and NFSv4.2 are compiled in.
      
      Fixes: 5b5faaf6 ("nfs4: add NFSv4 LOOKUPP handlers")
      Cc: stable@vger.kernel.org # v4.13+
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      8634ef5e
    • Trond Myklebust's avatar
      NFSv4: Convert LOCKU to use nfs4_async_handle_exception() · 82571552
      Trond Myklebust authored
      Convert CLOSE so that it specifies the correct stateid and
      inode for the error handling.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      82571552
    • Trond Myklebust's avatar
    • Trond Myklebust's avatar
      NFSv4: Convert CLOSE to use nfs4_async_handle_exception() · b8b8d221
      Trond Myklebust authored
      Convert CLOSE so that it specifies the correct stateid, state and
      inode for the error handling.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      b8b8d221
    • Trond Myklebust's avatar
      NFS: Add a cond_resched() to nfs_commit_release_pages() · 7f1bda44
      Trond Myklebust authored
      The commit list can get very large, and so we need a cond_resched()
      in nfs_commit_release_pages() in order to ensure we don't hog the CPU
      for excessive periods of time.
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      7f1bda44
  2. 14 Jan, 2018 9 commits
    • Linus Torvalds's avatar
      Linux 4.15-rc8 · a8750ddc
      Linus Torvalds authored
      a8750ddc
    • Linus Torvalds's avatar
      Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · aaae98a8
      Linus Torvalds authored
      Pull x86 fixlet from Thomas Gleixner.
      
      Remove a warning about lack of compiler support for retpoline that most
      people can't do anything about, so it just annoys them needlessly.
      
      * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/retpoline: Remove compile time warning
      aaae98a8
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.15-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 6bb82119
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "One fix for an oops at boot if we take a hotplug interrupt before we
        are ready to handle it.
      
        The bulk is patches to implement mitigation for Meltdown, see the
        change logs for more details.
      
        Thanks to: Nicholas Piggin, Michael Neuling, Oliver O'Halloran, Jon
        Masters, Jose Ricardo Ziviani, David Gibson"
      
      * tag 'powerpc-4.15-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv: Check device-tree for RFI flush settings
        powerpc/pseries: Query hypervisor for RFI flush settings
        powerpc/64s: Support disabling RFI flush with no_rfi_flush and nopti
        powerpc/64s: Add support for RFI flush of L1-D cache
        powerpc/64s: Convert slb_miss_common to use RFI_TO_USER/KERNEL
        powerpc/64: Convert fast_exception_return to use RFI_TO_USER/KERNEL
        powerpc/64: Convert the syscall exit path to use RFI_TO_USER/KERNEL
        powerpc/64s: Simple RFI macro conversions
        powerpc/64: Add macros for annotating the destination of rfid/hrfid
        powerpc/pseries: Add H_GET_CPU_CHARACTERISTICS flags & wrapper
        powerpc/pseries: Make RAS IRQ explicitly dependent on DLPAR WQ
      6bb82119
    • Thomas Gleixner's avatar
      x86/retpoline: Remove compile time warning · b8b9ce4b
      Thomas Gleixner authored
      Remove the compile time warning when CONFIG_RETPOLINE=y and the compiler
      does not have retpoline support. Linus rationale for this is:
      
        It's wrong because it will just make people turn off RETPOLINE, and the
        asm updates - and return stack clearing - that are independent of the
        compiler are likely the most important parts because they are likely the
        ones easiest to target.
      
        And it's annoying because most people won't be able to do anything about
        it. The number of people building their own compiler? Very small. So if
        their distro hasn't got a compiler yet (and pretty much nobody does), the
        warning is just annoying crap.
      
        It is already properly reported as part of the sysfs interface. The
        compile-time warning only encourages bad things.
      
      Fixes: 76b04384 ("x86/retpoline: Add initial retpoline support")
      Requested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: gnomes@lxorguk.ukuu.org.uk
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: thomas.lendacky@amd.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
      Link: https://lkml.kernel.org/r/CA+55aFzWgquv4i6Mab6bASqYXg3ErV3XDFEYf=GEcCDQg5uAtw@mail.gmail.com
      b8b9ce4b
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 9443c168
      Linus Torvalds authored
      Pull NVMe fix from Jens Axboe:
       "Just a single fix for nvme over fabrics that should go into 4.15"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        nvme-fabrics: initialize default host->id in nvmf_host_default()
      9443c168
    • Linus Torvalds's avatar
      Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 40548c6b
      Linus Torvalds authored
      Pull x86 pti updates from Thomas Gleixner:
       "This contains:
      
         - a PTI bugfix to avoid setting reserved CR3 bits when PCID is
           disabled. This seems to cause issues on a virtual machine at least
           and is incorrect according to the AMD manual.
      
         - a PTI bugfix which disables the perf BTS facility if PTI is
           enabled. The BTS AUX buffer is not globally visible and causes the
           CPU to fault when the mapping disappears on switching CR3 to user
           space. A full fix which restores BTS on PTI is non trivial and will
           be worked on.
      
         - PTI bugfixes for EFI and trusted boot which make sure that the user
           space visible page table entries have the NX bit cleared
      
         - removal of dead code in the PTI pagetable setup functions
      
         - add PTI documentation
      
         - add a selftest for vsyscall to verify that the kernel actually
           implements what it advertises.
      
         - a sysfs interface to expose vulnerability and mitigation
           information so there is a coherent way for users to retrieve the
           status.
      
         - the initial spectre_v2 mitigations, aka retpoline:
      
            + The necessary ASM thunk and compiler support
      
            + The ASM variants of retpoline and the conversion of affected ASM
              code
      
            + Make LFENCE serializing on AMD so it can be used as speculation
              trap
      
            + The RSB fill after vmexit
      
         - initial objtool support for retpoline
      
        As I said in the status mail this is the most of the set of patches
        which should go into 4.15 except two straight forward patches still on
        hold:
      
         - the retpoline add on of LFENCE which waits for ACKs
      
         - the RSB fill after context switch
      
        Both should be ready to go early next week and with that we'll have
        covered the major holes of spectre_v2 and go back to normality"
      
      * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
        x86,perf: Disable intel_bts when PTI
        security/Kconfig: Correct the Documentation reference for PTI
        x86/pti: Fix !PCID and sanitize defines
        selftests/x86: Add test_vsyscall
        x86/retpoline: Fill return stack buffer on vmexit
        x86/retpoline/irq32: Convert assembler indirect jumps
        x86/retpoline/checksum32: Convert assembler indirect jumps
        x86/retpoline/xen: Convert Xen hypercall indirect jumps
        x86/retpoline/hyperv: Convert assembler indirect jumps
        x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
        x86/retpoline/entry: Convert entry assembler indirect jumps
        x86/retpoline/crypto: Convert crypto assembler indirect jumps
        x86/spectre: Add boot time option to select Spectre v2 mitigation
        x86/retpoline: Add initial retpoline support
        objtool: Allow alternatives to be ignored
        objtool: Detect jumps to retpoline thunks
        x86/pti: Make unpoison of pgd for trusted boot work for real
        x86/alternatives: Fix optimize_nops() checking
        sysfs/cpu: Fix typos in vulnerability documentation
        x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
        ...
      40548c6b
    • Peter Zijlstra's avatar
      x86,perf: Disable intel_bts when PTI · 99a9dc98
      Peter Zijlstra authored
      The intel_bts driver does not use the 'normal' BTS buffer which is exposed
      through the cpu_entry_area but instead uses the memory allocated for the
      perf AUX buffer.
      
      This obviously comes apart when using PTI because then the kernel mapping;
      which includes that AUX buffer memory; disappears. Fixing this requires to
      expose a mapping which is visible in all context and that's not trivial.
      
      As a quick fix disable this driver when PTI is enabled to prevent
      malfunction.
      
      Fixes: 385ce0ea ("x86/mm/pti: Add Kconfig")
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Reported-by: default avatarRobert Święcki <robert@swiecki.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: greg@kroah.com
      Cc: hughd@google.com
      Cc: luto@amacapital.net
      Cc: Vince Weaver <vince@deater.net>
      Cc: torvalds@linux-foundation.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180114102713.GB6166@worktop.programming.kicks-ass.net
      99a9dc98
    • W. Trevor King's avatar
      security/Kconfig: Correct the Documentation reference for PTI · a237f762
      W. Trevor King authored
      When the config option for PTI was added a reference to documentation was
      added as well. But the documentation did not exist at that point. The final
      documentation has a different file name.
      
      Fix it up to point to the proper file.
      
      Fixes: 385ce0ea ("x86/mm/pti: Add Kconfig")
      Signed-off-by: default avatarW. Trevor King <wking@tremily.us>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-security-module@vger.kernel.org
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/3009cc8ccbddcd897ec1e0cb6dda524929de0d14.1515799398.git.wking@tremily.us
      a237f762
    • Thomas Gleixner's avatar
      x86/pti: Fix !PCID and sanitize defines · f10ee3dc
      Thomas Gleixner authored
      The switch to the user space page tables in the low level ASM code sets
      unconditionally bit 12 and bit 11 of CR3. Bit 12 is switching the base
      address of the page directory to the user part, bit 11 is switching the
      PCID to the PCID associated with the user page tables.
      
      This fails on a machine which lacks PCID support because bit 11 is set in
      CR3. Bit 11 is reserved when PCID is inactive.
      
      While the Intel SDM claims that the reserved bits are ignored when PCID is
      disabled, the AMD APM states that they should be cleared.
      
      This went unnoticed as the AMD APM was not checked when the code was
      developed and reviewed and test systems with Intel CPUs never failed to
      boot. The report is against a Centos 6 host where the guest fails to boot,
      so it's not yet clear whether this is a virt issue or can happen on real
      hardware too, but thats irrelevant as the AMD APM clearly ask for clearing
      the reserved bits.
      
      Make sure that on non PCID machines bit 11 is not set by the page table
      switching code.
      
      Andy suggested to rename the related bits and masks so they are clearly
      describing what they should be used for, which is done as well for clarity.
      
      That split could have been done with alternatives but the macro hell is
      horrible and ugly. This can be done on top if someone cares to remove the
      extra orq. For now it's a straight forward fix.
      
      Fixes: 6fd166aa ("x86/mm: Use/Fix PCID to optimize user/kernel switches")
      Reported-by: default avatarLaura Abbott <labbott@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable <stable@vger.kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801140009150.2371@nanos
      f10ee3dc
  3. 13 Jan, 2018 9 commits