1. 17 Apr, 2020 40 commits
    • Rosioru Dragos's avatar
      crypto: mxs-dcp - fix scatterlist linearization for hash · c127f180
      Rosioru Dragos authored
      commit fa03481b upstream.
      
      The incorrect traversal of the scatterlist, during the linearization phase
      lead to computing the hash value of the wrong input buffer.
      New implementation uses scatterwalk_map_and_copy()
      to address this issue.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 15b59e7c ("crypto: mxs - Add Freescale MXS DCP driver")
      Signed-off-by: default avatarRosioru Dragos <dragos.rosioru@nxp.com>
      Reviewed-by: default avatarHoria Geantă <horia.geanta@nxp.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c127f180
    • Robbie Ko's avatar
      btrfs: fix missing semaphore unlock in btrfs_sync_file · ed870340
      Robbie Ko authored
      commit 6ff06729 upstream.
      
      Ordered ops are started twice in sync file, once outside of inode mutex
      and once inside, taking the dio semaphore. There was one error path
      missing the semaphore unlock.
      
      Fixes: aab15e8e ("Btrfs: fix rare chances for data loss when doing a fast fsync")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      [ add changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed870340
    • Filipe Manana's avatar
      btrfs: fix missing file extent item for hole after ranged fsync · 867ae5eb
      Filipe Manana authored
      commit 95418ed1 upstream.
      
      When doing a fast fsync for a range that starts at an offset greater than
      zero, we can end up with a log that when replayed causes the respective
      inode miss a file extent item representing a hole if we are not using the
      NO_HOLES feature. This is because for fast fsyncs we don't log any extents
      that cover a range different from the one requested in the fsync.
      
      Example scenario to trigger it:
      
        $ mkfs.btrfs -O ^no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create a file with a single 256K and fsync it to clear to full sync
        # bit in the inode - we want the msync below to trigger a fast fsync.
        $ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo
      
        # Force a transaction commit and wipe out the log tree.
        $ sync
      
        # Dirty 768K of data, increasing the file size to 1Mb, and flush only
        # the range from 256K to 512K without updating the log tree
        # (sync_file_range() does not trigger fsync, it only starts writeback
        # and waits for it to finish).
      
        $ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo
        $ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo
      
        # Now dirty the range from 768K to 1M again and sync that range.
        $ xfs_io -c "mmap -w 768K 256K"        \
                 -c "mwrite -S 0xef 768K 256K" \
                 -c "msync -s 768K 256K"       \
                 -c "munmap"                   \
                 /mnt/foo
      
        <power fail>
      
        # Mount to replay the log.
        $ mount /dev/sdd /mnt
        $ umount /mnt
      
        $ btrfs check /dev/sdd
        Opening filesystem to check...
        Checking filesystem on /dev/sdd
        UUID: 482fb574-b288-478e-a190-a9c44a78fca6
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 5 inode 257 errors 100, file extent discount
        Found file extent holes:
             start: 262144, len: 524288
        ERROR: errors found in fs roots
        found 720896 bytes used, error(s) found
        total csum bytes: 512
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 123514
        file data blocks allocated: 589824
          referenced 589824
      
      Fix this issue by setting the range to full (0 to LLONG_MAX) when the
      NO_HOLES feature is not enabled. This results in extra work being done
      but it gives the guarantee we don't end up with missing holes after
      replaying the log.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      867ae5eb
    • Josef Bacik's avatar
      btrfs: drop block from cache on error in relocation · d8ecdce1
      Josef Bacik authored
      commit 8e19c973 upstream.
      
      If we have an error while building the backref tree in relocation we'll
      process all the pending edges and then free the node.  However if we
      integrated some edges into the cache we'll lose our link to those edges
      by simply freeing this node, which means we'll leak memory and
      references to any roots that we've found.
      
      Instead we need to use remove_backref_node(), which walks through all of
      the edges that are still linked to this node and free's them up and
      drops any root references we may be holding.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8ecdce1
    • Josef Bacik's avatar
      btrfs: set update the uuid generation as soon as possible · d3a7c4b8
      Josef Bacik authored
      commit 75ec1db8 upstream.
      
      In my EIO stress testing I noticed I was getting forced to rescan the
      uuid tree pretty often, which was weird.  This is because my error
      injection stuff would sometimes inject an error after log replay but
      before we loaded the UUID tree.  If log replay committed the transaction
      it wouldn't have updated the uuid tree generation, but the tree was
      valid and didn't change, so there's no reason to not update the
      generation here.
      
      Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
      after reading all the fs roots if the uuid tree generation matches the
      fs generation.  Then any transaction commits that happen during mount
      won't screw up our uuid tree state, forcing us to do needless uuid
      rescans.
      
      Fixes: 70f80175 ("Btrfs: check UUID tree during mount if required")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3a7c4b8
    • Filipe Manana's avatar
      Btrfs: fix crash during unmount due to race with delayed inode workers · 7ed0c4db
      Filipe Manana authored
      commit f0cc2cd7 upstream.
      
      During unmount we can have a job from the delayed inode items work queue
      still running, that can lead to at least two bad things:
      
      1) A crash, because the worker can try to create a transaction just
         after the fs roots were freed;
      
      2) A transaction leak, because the worker can create a transaction
         before the fs roots are freed and just after we committed the last
         transaction and after we stopped the transaction kthread.
      
      A stack trace example of the crash:
      
       [79011.691214] kernel BUG at lib/radix-tree.c:982!
       [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G        W         5.6.0-rc2-btrfs-next-54 #2
       (...)
       [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
       [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
       (...)
       [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
       [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
       [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
       [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
       [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
       [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
       [79011.709270] FS:  0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
       [79011.710699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
       [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [79011.715448] Call Trace:
       [79011.715925]  record_root_in_trans+0x72/0xf0 [btrfs]
       [79011.716819]  btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
       [79011.717925]  start_transaction+0xdd/0x5c0 [btrfs]
       [79011.718829]  btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
       [79011.719915]  btrfs_work_helper+0xaa/0x720 [btrfs]
       [79011.720773]  process_one_work+0x26d/0x6a0
       [79011.721497]  worker_thread+0x4f/0x3e0
       [79011.722153]  ? process_one_work+0x6a0/0x6a0
       [79011.722901]  kthread+0x103/0x140
       [79011.723481]  ? kthread_create_worker_on_cpu+0x70/0x70
       [79011.724379]  ret_from_fork+0x3a/0x50
       (...)
      
      The following diagram shows a sequence of steps that lead to the crash
      during ummount of the filesystem:
      
              CPU 1                                             CPU 2                                CPU 3
      
       btrfs_punch_hole()
         btrfs_btree_balance_dirty()
           btrfs_balance_delayed_items()
             --> sees
                 fs_info->delayed_root->items
                 with value 200, which is greater
                 than
                 BTRFS_DELAYED_BACKGROUND (128)
                 and smaller than
                 BTRFS_DELAYED_WRITEBACK (512)
             btrfs_wq_run_delayed_node()
               --> queues a job for
                   fs_info->delayed_workers to run
                   btrfs_async_run_delayed_root()
      
                                                                                                  btrfs_async_run_delayed_root()
                                                                                                    --> job queued by CPU 1
      
                                                                                                    --> starts picking and running
                                                                                                        delayed nodes from the
                                                                                                        prepare_list list
      
                                                       close_ctree()
      
                                                         btrfs_delete_unused_bgs()
      
                                                         btrfs_commit_super()
      
                                                           btrfs_join_transaction()
                                                             --> gets transaction N
      
                                                           btrfs_commit_transaction(N)
                                                             --> set transaction state
                                                              to TRANTS_STATE_COMMIT_START
      
                                                                                                   btrfs_first_prepared_delayed_node()
                                                                                                     --> picks delayed node X through
                                                                                                         the prepared_list list
      
                                                             btrfs_run_delayed_items()
      
                                                               btrfs_first_delayed_node()
                                                                 --> also picks delayed node X
                                                                     but through the node_list
                                                                     list
      
                                                               __btrfs_commit_inode_delayed_items()
                                                                  --> runs all delayed items from
                                                                      this node and drops the
                                                                      node's item count to 0
                                                                      through call to
                                                                      btrfs_release_delayed_inode()
      
                                                               --> finishes running any remaining
                                                                   delayed nodes
      
                                                             --> finishes transaction commit
      
                                                         --> stops cleaner and transaction threads
      
                                                         btrfs_free_fs_roots()
                                                           --> frees all roots and removes them
                                                               from the radix tree
                                                               fs_info->fs_roots_radix
      
                                                                                                   btrfs_join_transaction()
                                                                                                     start_transaction()
                                                                                                       btrfs_record_root_in_trans()
                                                                                                         record_root_in_trans()
                                                                                                           radix_tree_tag_set()
                                                                                                             --> crashes because
                                                                                                                 the root is not in
                                                                                                                 the radix tree
                                                                                                                 anymore
      
      If the worker is able to call btrfs_join_transaction() before the unmount
      task frees the fs roots, we end up leaking a transaction and all its
      resources, since after the call to btrfs_commit_super() and stopping the
      transaction kthread, we don't expect to have any transaction open anymore.
      
      When this situation happens the worker has a delayed node that has no
      more items to run, since the task calling btrfs_run_delayed_items(),
      which is doing a transaction commit, picks the same node and runs all
      its items first.
      
      We can not wait for the worker to complete when running delayed items
      through btrfs_run_delayed_items(), because we call that function in
      several phases of a transaction commit, and that could cause a deadlock
      because the worker calls btrfs_join_transaction() and the task doing the
      transaction commit may have already set the transaction state to
      TRANS_STATE_COMMIT_DOING.
      
      Also it's not possible to get into a situation where only some of the
      items of a delayed node are added to the fs/subvolume tree in the current
      transaction and the remaining ones in the next transaction, because when
      running the items of a delayed inode we lock its mutex, effectively
      waiting for the worker if the worker is running the items of the delayed
      node already.
      
      Since this can only cause issues when unmounting a filesystem, fix it in
      a simple way by waiting for any jobs on the delayed workers queue before
      calling btrfs_commit_supper() at close_ctree(). This works because at this
      point no one can call btrfs_btree_balance_dirty() or
      btrfs_balance_delayed_items(), and if we end up waiting for any worker to
      complete, btrfs_commit_super() will commit the transaction created by the
      worker.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ed0c4db
    • Frieder Schrempf's avatar
      mtd: spinand: Do not erase the block before writing a bad block marker · d389050b
      Frieder Schrempf authored
      commit b645ad39 upstream.
      
      Currently when marking a block, we use spinand_erase_op() to erase
      the block before writing the marker to the OOB area. Doing so without
      waiting for the operation to finish can lead to the marking failing
      silently and no bad block marker being written to the flash.
      
      In fact we don't need to do an erase at all before writing the BBM.
      The ECC is disabled for raw accesses to the OOB data and we don't
      need to work around any issues with chips reporting ECC errors as it
      is known to be the case for raw NAND.
      
      Fixes: 7529df46 ("mtd: nand: Add core infrastructure to support SPI NANDs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFrieder Schrempf <frieder.schrempf@kontron.de>
      Reviewed-by: default avatarBoris Brezillon <boris.brezillon@collabora.com>
      Signed-off-by: default avatarMiquel Raynal <miquel.raynal@bootlin.com>
      Link: https://lore.kernel.org/linux-mtd/20200218100432.32433-4-frieder.schrempf@kontron.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d389050b
    • Frieder Schrempf's avatar
      mtd: spinand: Stop using spinand->oobbuf for buffering bad block markers · a8899631
      Frieder Schrempf authored
      commit 21489375 upstream.
      
      For reading and writing the bad block markers, spinand->oobbuf is
      currently used as a buffer for the marker bytes. During the
      underlying read and write operations to actually get/set the content
      of the OOB area, the content of spinand->oobbuf is reused and changed
      by accessing it through spinand->oobbuf and/or spinand->databuf.
      
      This is a flaw in the original design of the SPI NAND core and at the
      latest from 13c15e07 ("mtd: spinand: Handle the case where
      PROGRAM LOAD does not reset the cache") on, it results in not having
      the bad block marker written at all, as the spinand->oobbuf is
      cleared to 0xff after setting the marker bytes to zero.
      
      To fix it, we now just store the two bytes for the marker on the
      stack and let the read/write operations copy it from/to the page
      buffer later.
      
      Fixes: 7529df46 ("mtd: nand: Add core infrastructure to support SPI NANDs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFrieder Schrempf <frieder.schrempf@kontron.de>
      Reviewed-by: default avatarBoris Brezillon <boris.brezillon@collabora.com>
      Signed-off-by: default avatarMiquel Raynal <miquel.raynal@bootlin.com>
      Link: https://lore.kernel.org/linux-mtd/20200218100432.32433-2-frieder.schrempf@kontron.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8899631
    • Yilu Lin's avatar
      CIFS: Fix bug which the return value by asynchronous read is error · 9bc02258
      Yilu Lin authored
      commit 97adda8b upstream.
      
      This patch is used to fix the bug in collect_uncached_read_data()
      that rc is automatically converted from a signed number to an
      unsigned number when the CIFS asynchronous read fails.
      It will cause ctx->rc is error.
      
      Example:
      Share a directory and create a file on the Windows OS.
      Mount the directory to the Linux OS using CIFS.
      On the CIFS client of the Linux OS, invoke the pread interface to
      deliver the read request.
      
      The size of the read length plus offset of the read request is greater
      than the maximum file size.
      
      In this case, the CIFS server on the Windows OS returns a failure
      message (for example, the return value of
      smb2.nt_status is STATUS_INVALID_PARAMETER).
      
      After receiving the response message, the CIFS client parses
      smb2.nt_status to STATUS_INVALID_PARAMETER
      and converts it to the Linux error code (rdata->result=-22).
      
      Then the CIFS client invokes the collect_uncached_read_data function to
      assign the value of rdata->result to rc, that is, rc=rdata->result=-22.
      
      The type of the ctx->total_len variable is unsigned integer,
      the type of the rc variable is integer, and the type of
      the ctx->rc variable is ssize_t.
      
      Therefore, during the ternary operation, the value of rc is
      automatically converted to an unsigned number. The final result is
      ctx->rc=4294967274. However, the expected result is ctx->rc=-22.
      Signed-off-by: default avatarYilu Lin <linyilu@huawei.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Acked-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9bc02258
    • Vitaly Kuznetsov's avatar
      KVM: VMX: fix crash cleanup when KVM wasn't used · f9971a89
      Vitaly Kuznetsov authored
      commit dbef2808 upstream.
      
      If KVM wasn't used at all before we crash the cleanup procedure fails with
       BUG: unable to handle page fault for address: ffffffffffffffc8
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 23215067 P4D 23215067 PUD 23217067 PMD 0
       Oops: 0000 [#8] SMP PTI
       CPU: 0 PID: 3542 Comm: bash Kdump: loaded Tainted: G      D           5.6.0-rc2+ #823
       RIP: 0010:crash_vmclear_local_loaded_vmcss.cold+0x19/0x51 [kvm_intel]
      
      The root cause is that loaded_vmcss_on_cpu list is not yet initialized,
      we initialize it in hardware_enable() but this only happens when we start
      a VM.
      
      Previously, we used to have a bitmap with enabled CPUs and that was
      preventing [masking] the issue.
      
      Initialized loaded_vmcss_on_cpu list earlier, right before we assign
      crash_vmclear_loaded_vmcss pointer. blocked_vcpu_on_cpu list and
      blocked_vcpu_on_cpu_lock are moved altogether for consistency.
      
      Fixes: 31603d4f ("KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200401081348.1345307-1-vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9971a89
    • Sean Christopherson's avatar
      KVM: x86: Gracefully handle __vmalloc() failure during VM allocation · 4538f42a
      Sean Christopherson authored
      commit d18b2f43 upstream.
      
      Check the result of __vmalloc() to avoid dereferencing a NULL pointer in
      the event that allocation failres.
      
      Fixes: d1e5b0e9 ("kvm: Make VM ioctl do valloc for some archs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4538f42a
    • Sean Christopherson's avatar
      KVM: VMX: Always VMCLEAR in-use VMCSes during crash with kexec support · a9f890aa
      Sean Christopherson authored
      commit 31603d4f upstream.
      
      VMCLEAR all in-use VMCSes during a crash, even if kdump's NMI shootdown
      interrupted a KVM update of the percpu in-use VMCS list.
      
      Because NMIs are not blocked by disabling IRQs, it's possible that
      crash_vmclear_local_loaded_vmcss() could be called while the percpu list
      of VMCSes is being modified, e.g. in the middle of list_add() in
      vmx_vcpu_load_vmcs().  This potential corner case was called out in the
      original commit[*], but the analysis of its impact was wrong.
      
      Skipping the VMCLEARs is wrong because it all but guarantees that a
      loaded, and therefore cached, VMCS will live across kexec and corrupt
      memory in the new kernel.  Corruption will occur because the CPU's VMCS
      cache is non-coherent, i.e. not snooped, and so the writeback of VMCS
      memory on its eviction will overwrite random memory in the new kernel.
      The VMCS will live because the NMI shootdown also disables VMX, i.e. the
      in-progress VMCLEAR will #UD, and existing Intel CPUs do not flush the
      VMCS cache on VMXOFF.
      
      Furthermore, interrupting list_add() and list_del() is safe due to
      crash_vmclear_local_loaded_vmcss() using forward iteration.  list_add()
      ensures the new entry is not visible to forward iteration unless the
      entire add completes, via WRITE_ONCE(prev->next, new).  A bad "prev"
      pointer could be observed if the NMI shootdown interrupted list_del() or
      list_add(), but list_for_each_entry() does not consume ->prev.
      
      In addition to removing the temporary disabling of VMCLEAR, open code
      loaded_vmcs_init() in __loaded_vmcs_clear() and reorder VMCLEAR so that
      the VMCS is deleted from the list only after it's been VMCLEAR'd.
      Deleting the VMCS before VMCLEAR would allow a race where the NMI
      shootdown could arrive between list_del() and vmcs_clear() and thus
      neither flow would execute a successful VMCLEAR.  Alternatively, more
      code could be moved into loaded_vmcs_init(), but that gets rather silly
      as the only other user, alloc_loaded_vmcs(), doesn't need the smp_wmb()
      and would need to work around the list_del().
      
      Update the smp_*() comments related to the list manipulation, and
      opportunistically reword them to improve clarity.
      
      [*] https://patchwork.kernel.org/patch/1675731/#3720461
      
      Fixes: 8f536b76 ("KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200321193751.24985-2-sean.j.christopherson@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a9f890aa
    • Sean Christopherson's avatar
      KVM: x86: Allocate new rmap and large page tracking when moving memslot · 4a0efabb
      Sean Christopherson authored
      commit edd4fa37 upstream.
      
      Reallocate a rmap array and recalcuate large page compatibility when
      moving an existing memslot to correctly handle the alignment properties
      of the new memslot.  The number of rmap entries required at each level
      is dependent on the alignment of the memslot's base gfn with respect to
      that level, e.g. moving a large-page aligned memslot so that it becomes
      unaligned will increase the number of rmap entries needed at the now
      unaligned level.
      
      Not updating the rmap array is the most obvious bug, as KVM accesses
      garbage data beyond the end of the rmap.  KVM interprets the bad data as
      pointers, leading to non-canonical #GPs, unexpected #PFs, etc...
      
        general protection fault: 0000 [#1] SMP
        CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:rmap_get_first+0x37/0x50 [kvm]
        Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3
        RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246
        RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012
        RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570
        RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8
        R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000
        R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8
        FS:  00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0
        Call Trace:
         kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm]
         __kvm_set_memory_region.part.64+0x559/0x960 [kvm]
         kvm_set_memory_region+0x45/0x60 [kvm]
         kvm_vm_ioctl+0x30f/0x920 [kvm]
         do_vfs_ioctl+0xa1/0x620
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x4c/0x170
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f7ed9911f47
        Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48
        RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47
        RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004
        RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700
        R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000
        R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000
        Modules linked in: kvm_intel kvm irqbypass
        ---[ end trace 0c5f570b3358ca89 ]---
      
      The disallow_lpage tracking is more subtle.  Failure to update results
      in KVM creating large pages when it shouldn't, either due to stale data
      or again due to indexing beyond the end of the metadata arrays, which
      can lead to memory corruption and/or leaking data to guest/userspace.
      
      Note, the arrays for the old memslot are freed by the unconditional call
      to kvm_free_memslot() in __kvm_set_memory_region().
      
      Fixes: 05da4558 ("KVM: MMU: large page support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a0efabb
    • David Hildenbrand's avatar
      KVM: s390: vsie: Fix delivery of addressing exceptions · de2ac8a7
      David Hildenbrand authored
      commit 4d4cee96 upstream.
      
      Whenever we get an -EFAULT, we failed to read in guest 2 physical
      address space. Such addressing exceptions are reported via a program
      intercept to the nested hypervisor.
      
      We faked the intercept, we have to return to guest 2. Instead, right
      now we would be returning -EFAULT from the intercept handler, eventually
      crashing the VM.
      the correct thing to do is to return 1 as rc == 1 is the internal
      representation of "we have to go back into g2".
      
      Addressing exceptions can only happen if the g2->g3 page tables
      reference invalid g2 addresses (say, either a table or the final page is
      not accessible - so something that basically never happens in sane
      environments.
      
      Identified by manual code inspection.
      
      Fixes: a3508fbe ("KVM: s390: vsie: initial support for nested virtualization")
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200403153050.20569-3-david@redhat.comReviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      [borntraeger@de.ibm.com: fix patch description]
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de2ac8a7
    • David Hildenbrand's avatar
      KVM: s390: vsie: Fix region 1 ASCE sanity shadow address checks · 50a59d2d
      David Hildenbrand authored
      commit a1d032a4 upstream.
      
      In case we have a region 1 the following calculation
      (31 + ((gmap->asce & _ASCE_TYPE_MASK) >> 2)*11)
      results in 64. As shifts beyond the size are undefined the compiler is
      free to use instructions like sllg. sllg will only use 6 bits of the
      shift value (here 64) resulting in no shift at all. That means that ALL
      addresses will be rejected.
      
      The can result in endless loops, e.g. when prefix cannot get mapped.
      
      Fixes: 4be130a0 ("s390/mm: add shadow gmap support")
      Tested-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Reported-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200403153050.20569-2-david@redhat.comReviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      [borntraeger@de.ibm.com: fix patch description, remove WARN_ON_ONCE]
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50a59d2d
    • Sean Christopherson's avatar
      KVM: nVMX: Properly handle userspace interrupt window request · deecbb36
      Sean Christopherson authored
      commit a1c77abb upstream.
      
      Return true for vmx_interrupt_allowed() if the vCPU is in L2 and L1 has
      external interrupt exiting enabled.  IRQs are never blocked in hardware
      if the CPU is in the guest (L2 from L1's perspective) when IRQs trigger
      VM-Exit.
      
      The new check percolates up to kvm_vcpu_ready_for_interrupt_injection()
      and thus vcpu_run(), and so KVM will exit to userspace if userspace has
      requested an interrupt window (to inject an IRQ into L1).
      
      Remove the @external_intr param from vmx_check_nested_events(), which is
      actually an indicator that userspace wants an interrupt window, e.g.
      it's named @req_int_win further up the stack.  Injecting a VM-Exit into
      L1 to try and bounce out to L0 userspace is all kinds of broken and is
      no longer necessary.
      
      Remove the hack in nested_vmx_vmexit() that attempted to workaround the
      breakage in vmx_check_nested_events() by only filling interrupt info if
      there's an actual interrupt pending.  The hack actually made things
      worse because it caused KVM to _never_ fill interrupt info when the
      LAPIC resides in userspace (kvm_cpu_has_interrupt() queries
      interrupt.injected, which is always cleared by prepare_vmcs12() before
      reaching the hack in nested_vmx_vmexit()).
      
      Fixes: 6550c4df ("KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"")
      Cc: stable@vger.kernel.org
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      deecbb36
    • Thomas Gleixner's avatar
      x86/entry/32: Add missing ASM_CLAC to general_protection entry · 7460d17c
      Thomas Gleixner authored
      commit 3d51507f upstream.
      
      All exception entry points must have ASM_CLAC right at the
      beginning. The general_protection entry is missing one.
      
      Fixes: e59d1b0a ("x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarAlexandre Chartre <alexandre.chartre@oracle.com>
      Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200225220216.219537887@linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7460d17c
    • Eric W. Biederman's avatar
      signal: Extend exec_id to 64bits · a2a1be2d
      Eric W. Biederman authored
      commit d1e7fd64 upstream.
      
      Replace the 32bit exec_id with a 64bit exec_id to make it impossible
      to wrap the exec_id counter.  With care an attacker can cause exec_id
      wrap and send arbitrary signals to a newly exec'd parent.  This
      bypasses the signal sending checks if the parent changes their
      credentials during exec.
      
      The severity of this problem can been seen that in my limited testing
      of a 32bit exec_id it can take as little as 19s to exec 65536 times.
      Which means that it can take as little as 14 days to wrap a 32bit
      exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
      days.  Even my slower timing is in the uptime of a typical server.
      Which means self_exec_id is simply a speed bump today, and if exec
      gets noticably faster self_exec_id won't even be a speed bump.
      
      Extending self_exec_id to 64bits introduces a problem on 32bit
      architectures where reading self_exec_id is no longer atomic and can
      take two read instructions.  Which means that is is possible to hit
      a window where the read value of exec_id does not match the written
      value.  So with very lucky timing after this change this still
      remains expoiltable.
      
      I have updated the update of exec_id on exec to use WRITE_ONCE
      and the read of exec_id in do_notify_parent to use READ_ONCE
      to make it clear that there is no locking between these two
      locations.
      
      Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
      Fixes: 2.3.23pre2
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2a1be2d
    • Remi Pommarel's avatar
      ath9k: Handle txpower changes even when TPC is disabled · 19e119d4
      Remi Pommarel authored
      commit 968ae2ca upstream.
      
      When TPC is disabled IEEE80211_CONF_CHANGE_POWER event can be handled to
      reconfigure HW's maximum txpower.
      
      This fixes 0dBm txpower setting when user attaches to an interface for
      the first time with the following scenario:
      
      ieee80211_do_open()
          ath9k_add_interface()
              ath9k_set_txpower() /* Set TX power with not yet initialized
                                     sc->hw->conf.power_level */
      
          ieee80211_hw_config() /* Iniatilize sc->hw->conf.power_level and
                                   raise IEEE80211_CONF_CHANGE_POWER */
      
          ath9k_config() /* IEEE80211_CONF_CHANGE_POWER is ignored */
      
      This issue can be reproduced with the following:
      
        $ modprobe -r ath9k
        $ modprobe ath9k
        $ wpa_supplicant -i wlan0 -c /tmp/wpa.conf &
        $ iw dev /* Here TX power is either 0 or 3 depending on RF chain */
        $ killall wpa_supplicant
        $ iw dev /* TX power goes back to calibrated value and subsequent
                    calls will be fine */
      
      Fixes: 283dd119 ("ath9k: add per-vif TX power capability")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarRemi Pommarel <repk@triplefau.lt>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19e119d4
    • Gustavo A. R. Silva's avatar
      MIPS: OCTEON: irq: Fix potential NULL pointer dereference · cde7e660
      Gustavo A. R. Silva authored
      commit 792a402c upstream.
      
      There is a potential NULL pointer dereference in case kzalloc()
      fails and returns NULL.
      
      Fix this by adding a NULL check on *cd*
      
      This bug was detected with the help of Coccinelle.
      
      Fixes: 64b139f9 ("MIPS: OCTEON: irq: add CIB and other fixes")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cde7e660
    • Huacai Chen's avatar
      MIPS/tlbex: Fix LDDIR usage in setup_pw() for Loongson-3 · 67dea3c7
      Huacai Chen authored
      commit d191aaff upstream.
      
      LDDIR/LDPTE is Loongson-3's acceleration for Page Table Walking. If BD
      (Base Directory, the 4th page directory) is not enabled, then GDOffset
      is biased by BadVAddr[63:62]. So, if GDOffset (aka. BadVAddr[47:36] for
      Loongson-3) is big enough, "0b11(BadVAddr[63:62])|BadVAddr[47:36]|...."
      can far beyond pg_swapper_dir. This means the pg_swapper_dir may NOT be
      accessed by LDDIR correctly, so fix it by set PWDirExt in CP0_PWCtl.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarPei Huang <huangpei@loongson.cn>
      Signed-off-by: default avatarHuacai Chen <chenhc@lemote.com>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67dea3c7
    • Vasily Averin's avatar
      pstore: pstore_ftrace_seq_next should increase position index · 76b48e98
      Vasily Averin authored
      commit 6c871b73 upstream.
      
      In Aug 2018 NeilBrown noticed
      commit 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
      "Some ->next functions do not increment *pos when they return NULL...
      Note that such ->next functions are buggy and should be fixed.
      A simple demonstration is
      
       dd if=/proc/swaps bs=1000 skip=1
      
      Choose any block size larger than the size of /proc/swaps. This will
      always show the whole last line of /proc/swaps"
      
      /proc/swaps output was fixed recently, however there are lot of other
      affected files, and one of them is related to pstore subsystem.
      
      If .next function does not change position index, following .show function
      will repeat output related to current position index.
      
      There are at least 2 related problems:
      - read after lseek beyond end of file, described above by NeilBrown
        "dd if=<AFFECTED_FILE> bs=1000 skip=1" will generate whole last list
      - read after lseek on in middle of last line will output expected rest of
        last line but then repeat whole last line once again.
      
      If .show() function generates multy-line output (like
      pstore_ftrace_seq_show() does ?) following bash script cycles endlessly
      
       $ q=;while read -r r;do echo "$((++q)) $r";done < AFFECTED_FILE
      
      Unfortunately I'm not familiar enough to pstore subsystem and was unable
      to find affected pstore-related file on my test node.
      
      If .next function does not change position index, following .show function
      will repeat output related to current position index.
      
      Cc: stable@vger.kernel.org
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code ...")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Link: https://lore.kernel.org/r/4e49830d-4c88-0171-ee24-1ee540028dad@virtuozzo.com
      [kees: with robustness tweak from Joel Fernandes <joelaf@google.com>]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76b48e98
    • Sungbo Eo's avatar
      irqchip/versatile-fpga: Apply clear-mask earlier · 977cab66
      Sungbo Eo authored
      commit 6a214a28 upstream.
      
      Clear its own IRQs before the parent IRQ get enabled, so that the
      remaining IRQs do not accidentally interrupt the parent IRQ controller.
      
      This patch also fixes a reboot bug on OX820 SoC, where the remaining
      rps-timer IRQ raises a GIC interrupt that is left pending. After that,
      the rps-timer IRQ is cleared during driver initialization, and there's
      no IRQ left in rps-irq when local_irq_enable() is called, which evokes
      an error message "unexpected IRQ trap".
      
      Fixes: bdd272cb ("irqchip: versatile FPGA: support cascaded interrupts from DT")
      Signed-off-by: default avatarSungbo Eo <mans0n@gorani.run>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20200321133842.2408823-1-mans0n@gorani.runSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      977cab66
    • Yang Xu's avatar
      KEYS: reaching the keys quotas correctly · 14b96359
      Yang Xu authored
      commit 2e356101 upstream.
      
      Currently, when we add a new user key, the calltrace as below:
      
      add_key()
        key_create_or_update()
          key_alloc()
          __key_instantiate_and_link
            generic_key_instantiate
              key_payload_reserve
                ......
      
      Since commit a08bf91c ("KEYS: allow reaching the keys quotas exactly"),
      we can reach max bytes/keys in key_alloc, but we forget to remove this
      limit when we reserver space for payload in key_payload_reserve. So we
      can only reach max keys but not max bytes when having delta between plen
      and type->def_datalen. Remove this limit when instantiating the key, so we
      can keep consistent with key_alloc.
      
      Also, fix the similar problem in keyctl_chown_key().
      
      Fixes: 0b77f5bf ("keys: make the keyring quotas controllable through /proc/sys")
      Fixes: a08bf91c ("KEYS: allow reaching the keys quotas exactly")
      Cc: stable@vger.kernel.org # 5.0.x
      Cc: Eric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarYang Xu <xuyang2018.jy@cn.fujitsu.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      14b96359
    • Vasily Averin's avatar
      tpm: tpm2_bios_measurements_next should increase position index · 64157692
      Vasily Averin authored
      commit f9bf8adb upstream.
      
      If .next function does not change position index,
      following .show function will repeat output related
      to current position index.
      
      For /sys/kernel/security/tpm0/binary_bios_measurements:
      1) read after lseek beyound end of file generates whole last line.
      2) read after lseek to middle of last line generates
      expected end of last line and unexpected whole last line once again.
      
      Cc: stable@vger.kernel.org # 4.19.x
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code ...")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64157692
    • Vasily Averin's avatar
      tpm: tpm1_bios_measurements_next should increase position index · 1da36bed
      Vasily Averin authored
      commit d7a47b96 upstream.
      
      If .next function does not change position index,
      following .show function will repeat output related
      to current position index.
      
      In case of /sys/kernel/security/tpm0/ascii_bios_measurements
      and binary_bios_measurements:
      1) read after lseek beyound end of file generates whole last line.
      2) read after lseek to middle of last line generates
      expected end of last line and unexpected whole last line once again.
      
      Cc: stable@vger.kernel.org # 4.19.x
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code ...")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1da36bed
    • Matthew Garrett's avatar
      tpm: Don't make log failures fatal · 7c775e8e
      Matthew Garrett authored
      commit 805fa88e upstream.
      
      If a TPM is in disabled state, it's reasonable for it to have an empty
      log. Bailing out of probe in this case means that the PPI interface
      isn't available, so there's no way to then enable the TPM from the OS.
      In general it seems reasonable to ignore log errors - they shouldn't
      interfere with any other TPM functionality.
      Signed-off-by: default avatarMatthew Garrett <matthewgarrett@google.com>
      Cc: stable@vger.kernel.org # 4.19.x
      Reviewed-by: default avatarJerry Snitselaar <jsnitsel@redhat.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c775e8e
    • Kishon Vijay Abraham I's avatar
      PCI: endpoint: Fix for concurrent memory allocation in OB address region · 9ffaeee7
      Kishon Vijay Abraham I authored
      commit 04e046ca upstream.
      
      pci-epc-mem uses a bitmap to manage the Endpoint outbound (OB) address
      region. This address region will be shared by multiple endpoint
      functions (in the case of multi function endpoint) and it has to be
      protected from concurrent access to avoid updating an inconsistent state.
      
      Use a mutex to protect bitmap updates to prevent the memory
      allocation API from returning incorrect addresses.
      Signed-off-by: default avatarKishon Vijay Abraham I <kishon@ti.com>
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ffaeee7
    • Sean V Kelley's avatar
      PCI: Add boot interrupt quirk mechanism for Xeon chipsets · d2345d12
      Sean V Kelley authored
      commit b88bf6c3 upstream.
      
      The following was observed by Kar Hin Ong with RT patchset:
      
        Backtrace:
        irq 19: nobody cared (try booting with the "irqpoll" option)
        CPU: 0 PID: 3329 Comm: irq/34-nipalk Tainted:4.14.87-rt49 #1
        Hardware name: National Instruments NI PXIe-8880/NI PXIe-8880,
                 BIOS 2.1.5f1 01/09/2020
        Call Trace:
        <IRQ>
          ? dump_stack+0x46/0x5e
          ? __report_bad_irq+0x2e/0xb0
          ? note_interrupt+0x242/0x290
          ? nNIKAL100_memoryRead16+0x8/0x10 [nikal]
          ? handle_irq_event_percpu+0x55/0x70
          ? handle_irq_event+0x4f/0x80
          ? handle_fasteoi_irq+0x81/0x180
          ? handle_irq+0x1c/0x30
          ? do_IRQ+0x41/0xd0
          ? common_interrupt+0x84/0x84
        </IRQ>
        ...
        handlers:
        [<ffffffffb3297200>] irq_default_primary_handler threaded
        [<ffffffffb3669180>] usb_hcd_irq
        Disabling IRQ #19
      
      The problem being that this device is triggering boot interrupts
      due to threaded interrupt handling and masking of the IO-APIC. These
      boot interrupts are then forwarded on to the legacy PCH's PIRQ lines
      where there is no handler present for the device.
      
      Whenever a PCI device fires interrupt (INTx) to Pin 20 of IOAPIC 2
      (GSI 44), the kernel receives two interrupts:
      
         1. Interrupt from Pin 20 of IOAPIC 2  -> Expected
         2. Interrupt from Pin 19 of IOAPIC 1  -> UNEXPECTED
      
      Quirks for disabling boot interrupts (preferred) or rerouting the
      handler exist but do not address these Xeon chipsets' mechanism:
      https://lore.kernel.org/lkml/12131949181903-git-send-email-sassmann@suse.de/
      
      Add a new mechanism via PCI CFG for those chipsets supporting CIPINTRC
      register's dis_intx_rout2ich bit.
      
      Link: https://lore.kernel.org/r/20200220192930.64820-2-sean.v.kelley@linux.intel.comReported-by: default avatarKar Hin Ong <kar.hin.ong@ni.com>
      Tested-by: default avatarKar Hin Ong <kar.hin.ong@ni.com>
      Signed-off-by: default avatarSean V Kelley <sean.v.kelley@linux.intel.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2345d12
    • Yicong Yang's avatar
      PCI/ASPM: Clear the correct bits when enabling L1 substates · a73afecb
      Yicong Yang authored
      commit 58a3862a upstream.
      
      In pcie_config_aspm_l1ss(), we cleared the wrong bits when enabling ASPM L1
      Substates.  Instead of the L1.x enable bits (PCI_L1SS_CTL1_L1SS_MASK, 0xf), we
      cleared the Link Activation Interrupt Enable bit (PCI_L1SS_CAP_L1_PM_SS,
      0x10).
      
      Clear the L1.x enable bits before writing the new L1.x configuration.
      
      [bhelgaas: changelog]
      Fixes: aeda9ade ("PCI/ASPM: Configure L1 substate settings")
      Link: https://lore.kernel.org/r/1584093227-1292-1-git-send-email-yangyicong@hisilicon.comSigned-off-by: default avatarYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      CC: stable@vger.kernel.org	# v4.11+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a73afecb
    • Lukas Wunner's avatar
      PCI: pciehp: Fix indefinite wait on sysfs requests · 1ada617e
      Lukas Wunner authored
      commit 3e487d2e upstream.
      
      David Hoyer reports that powering pciehp slots up or down via sysfs may
      hang:  The call to wait_event() in pciehp_sysfs_enable_slot() and
      _disable_slot() does not return because ctrl->ist_running remains true.
      
      This flag, which was introduced by commit 157c1062 ("PCI: pciehp: Avoid
      returning prematurely from sysfs requests"), signifies that the IRQ thread
      pciehp_ist() is running.  It is set to true at the top of pciehp_ist() and
      reset to false at the end.  However there are two additional return
      statements in pciehp_ist() before which the commit neglected to reset the
      flag to false and wake up waiters for the flag.
      
      That omission opens up the following race when powering up the slot:
      
      * pciehp_ist() runs because a PCI_EXP_SLTSTA_PDC event was requested
        by pciehp_sysfs_enable_slot()
      
      * pciehp_ist() turns on slot power via the following call stack:
        pciehp_handle_presence_or_link_change() -> pciehp_enable_slot() ->
        __pciehp_enable_slot() -> board_added() -> pciehp_power_on_slot()
      
      * after slot power is turned on, the link comes up, resulting in a
        PCI_EXP_SLTSTA_DLLSC event
      
      * the IRQ handler pciehp_isr() stores the event in ctrl->pending_events
        and returns IRQ_WAKE_THREAD
      
      * the IRQ thread is already woken (it's bringing up the slot), but the
        genirq code remembers to re-run the IRQ thread after it has finished
        (such that it can deal with the new event) by setting IRQTF_RUNTHREAD
        via __handle_irq_event_percpu() -> __irq_wake_thread()
      
      * the IRQ thread removes PCI_EXP_SLTSTA_DLLSC from ctrl->pending_events
        via board_added() -> pciehp_check_link_status() in order to deal with
        presence and link flaps per commit 6c35a1ac ("PCI: pciehp:
        Tolerate initially unstable link")
      
      * after pciehp_ist() has successfully brought up the slot, it resets
        ctrl->ist_running to false and wakes up the sysfs requester
      
      * the genirq code re-runs pciehp_ist(), which sets ctrl->ist_running
        to true but then returns with IRQ_NONE because ctrl->pending_events
        is empty
      
      * pciehp_sysfs_enable_slot() is finally woken but notices that
        ctrl->ist_running is true, hence continues waiting
      
      The only way to get the hung task going again is to trigger a hotplug
      event which brings down the slot, e.g. by yanking out the card.
      
      The same race exists when powering down the slot because remove_board()
      likewise clears link or presence changes in ctrl->pending_events per commit
      3943af9d ("PCI: pciehp: Ignore Link State Changes after powering off a
      slot") and thereby may cause a re-run of pciehp_ist() which returns with
      IRQ_NONE without resetting ctrl->ist_running to false.
      
      Fix by adding a goto label before the teardown steps at the end of
      pciehp_ist() and jumping to that label from the two return statements which
      currently neglect to reset the ctrl->ist_running flag.
      
      Fixes: 157c1062 ("PCI: pciehp: Avoid returning prematurely from sysfs requests")
      Link: https://lore.kernel.org/r/cca1effa488065cb055120aa01b65719094bdcb5.1584530321.git.lukas@wunner.deReported-by: default avatarDavid Hoyer <David.Hoyer@netapp.com>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Cc: stable@vger.kernel.org	# v4.19+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ada617e
    • James Smart's avatar
      nvme: Treat discovery subsystems as unique subsystems · 011529b7
      James Smart authored
      commit c26aa572 upstream.
      
      Current code matches subnqn and collapses all controllers to the
      same subnqn to a single subsystem structure. This is good for
      recognizing multiple controllers for the same subsystem. But with
      the well-known discovery subnqn, the subsystems aren't truly the
      same subsystem. As such, subsystem specific rules, such as no
      overlap of controller id, do not apply. With today's behavior, the
      check for overlap of controller id can fail, preventing the new
      discovery controller from being created.
      
      When searching for like subsystem nqn, exclude the discovery nqn
      from matching. This will result in each discovery controller being
      attached to a unique subsystem structure.
      Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      011529b7
    • James Smart's avatar
      nvme-fc: Revert "add module to ops template to allow module references" · 287ea8b4
      James Smart authored
      commit 8c5c6605 upstream.
      
      The original patch was to resolve the lldd being able to be unloaded
      while being used to talk to the boot device of the system. However, the
      end result of the original patch is that any driver unload while a nvme
      controller is live via the lldd is now being prohibited. Given the module
      reference, the module teardown routine can't be called, thus there's no
      way, other than manual actions to terminate the controllers.
      
      Fixes: 863fbae9 ("nvme_fc: add module to ops template to allow module references")
      Cc: <stable@vger.kernel.org> # v5.4+
      Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
      Reviewed-by: default avatarHimanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      287ea8b4
    • Martin Blumenstingl's avatar
      thermal: devfreq_cooling: inline all stubs for CONFIG_DEVFREQ_THERMAL=n · 46cc8837
      Martin Blumenstingl authored
      commit 3f5b9959 upstream.
      
      When CONFIG_DEVFREQ_THERMAL is disabled all functions except
      of_devfreq_cooling_register_power() were already inlined. Also inline
      the last function to avoid compile errors when multiple drivers call
      of_devfreq_cooling_register_power() when CONFIG_DEVFREQ_THERMAL is not
      set. Compilation failed with the following message:
        multiple definition of `of_devfreq_cooling_register_power'
      (which then lists all usages of of_devfreq_cooling_register_power())
      
      Thomas Zimmermann reported this problem [0] on a kernel config with
      CONFIG_DRM_LIMA={m,y}, CONFIG_DRM_PANFROST={m,y} and
      CONFIG_DEVFREQ_THERMAL=n after both, the lima and panfrost drivers
      gained devfreq cooling support.
      
      [0] https://www.spinics.net/lists/dri-devel/msg252825.html
      
      Fixes: a76caf55 ("thermal: Add devfreq cooling")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarThomas Zimmermann <tzimmermann@suse.de>
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Tested-by: default avatarThomas Zimmermann <tzimmermann@suse.de>
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Link: https://lore.kernel.org/r/20200403205133.1101808-1-martin.blumenstingl@googlemail.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46cc8837
    • Jan Engelhardt's avatar
      acpi/x86: ignore unspecified bit positions in the ACPI global lock field · d56a8ea4
      Jan Engelhardt authored
      commit ecb9c790 upstream.
      
      The value in "new" is constructed from "old" such that all bits defined
      as reserved by the ACPI spec[1] are left untouched. But if those bits
      do not happen to be all zero, "new < 3" will not evaluate to true.
      
      The firmware of the laptop(s) Medion MD63490 / Akoya P15648 comes with
      garbage inside the "FACS" ACPI table. The starting value is
      old=0x4944454d, therefore new=0x4944454e, which is >= 3. Mask off
      the reserved bits.
      
      [1] https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206553
      Cc: All applicable <stable@vger.kernel.org>
      Signed-off-by: default avatarJan Engelhardt <jengelh@inai.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d56a8ea4
    • Benoit Parrot's avatar
      media: ti-vpe: cal: fix disable_irqs to only the intended target · 811a3f83
      Benoit Parrot authored
      commit 1db56284 upstream.
      
      disable_irqs() was mistakenly disabling all interrupts when called.
      This cause all port stream to stop even if only stopping one of them.
      
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarBenoit Parrot <bparrot@ti.com>
      Signed-off-by: default avatarHans Verkuil <hverkuil-cisco@xs4all.nl>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      811a3f83
    • Takashi Iwai's avatar
      ALSA: hda/realtek - Add quirk for MSI GL63 · 2c3dab1b
      Takashi Iwai authored
      commit 1d3aa4a5 upstream.
      
      MSI GL63 laptop requires the similar quirk like other MSI models,
      ALC1220_FIXUP_CLEVO_P950.  The board BIOS doesn't provide a PCI SSID
      for the device, hence we need to take the codec SSID (1462:1275)
      instead.
      
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=207157
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20200408135645.21896-1-tiwai@suse.deSigned-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c3dab1b
    • Thomas Hebb's avatar
      ALSA: hda/realtek - Remove now-unnecessary XPS 13 headphone noise fixups · e71c369b
      Thomas Hebb authored
      commit f36938aa upstream.
      
      patch_realtek.c has historically failed to properly configure the PC
      Beep Hidden Register for the ALC256 codec (among others). Depending on
      your kernel version, symptoms of this misconfiguration can range from
      chassis noise, picked up by a poorly-shielded PCBEEP trace, getting
      amplified and played on your internal speaker and/or headphones to loud
      feedback, which responds to the "Headphone Mic Boost" ALSA control,
      getting played through your headphones. For details of the problem, see
      the patch in this series titled "ALSA: hda/realtek - Set principled PC
      Beep configuration for ALC256", which fixes the configuration.
      
      These symptoms have been most noticed on the Dell XPS 13 9350 and 9360,
      popular laptops that use the ALC256. As a result, several model-specific
      fixups have been introduced to try and fix the problem, the most
      egregious of which locks the "Headphone Mic Boost" control as a hack to
      minimize noise from a feedback loop that shouldn't have been there in
      the first place.
      
      Now that the underlying issue has been fixed, remove all these fixups.
      Remaining fixups needed by the XPS 13 are all picked up by existing pin
      quirks.
      
      This change should, for the XPS 13 9350/9360
      
       - Significantly increase volume and audio quality on headphones
       - Eliminate headphone popping on suspend/resume
       - Allow "Headphone Mic Boost" to be set again, making the headphone
         jack fully usable as a microphone jack too.
      
      Fixes: 8c69729b ("ALSA: hda - Fix headphone noise after Dell XPS 13 resume back from S3")
      Fixes: 423cd785 ("ALSA: hda - Fix headphone noise on Dell XPS 13 9360")
      Fixes: e4c9fd10 ("ALSA: hda - Apply headphone noise quirk for another Dell XPS 13 variant")
      Fixes: 1099f484 ("ALSA: hda/realtek: Reduce the Headphone static noise on XPS 9350/9360")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Hebb <tommyhebb@gmail.com>
      Link: https://lore.kernel.org/r/b649a00edfde150cf6eebbb4390e15e0c2deb39a.1585584498.git.tommyhebb@gmail.comSigned-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e71c369b
    • Thomas Hebb's avatar
      ALSA: hda/realtek - Set principled PC Beep configuration for ALC256 · 92b27256
      Thomas Hebb authored
      commit c4473744 upstream.
      
      The Realtek PC Beep Hidden Register[1] is currently set by
      patch_realtek.c in two different places:
      
      In alc_fill_eapd_coef(), it's set to the value 0x5757, corresponding to
      non-beep input on 1Ah and no 1Ah loopback to either headphones or
      speakers. (Although, curiously, the loopback amp is still enabled.) This
      write was added fairly recently by commit e3743f431143 ("ALSA:
      hda/realtek - Dell headphone has noise on unmute for ALC236") and is a
      safe default. However, it happens in the wrong place:
      alc_fill_eapd_coef() runs on module load and cold boot but not on S3
      resume, meaning the register loses its value after suspend.
      
      Conversely, in alc256_init(), the register is updated to unset bit 13
      (disable speaker loopback) and set bit 5 (set non-beep input on 1Ah).
      Although this write does run on S3 resume, it's not quite enough to fix
      up the register's default value of 0x3717. What's missing is a set of
      bit 14 to disable headphone loopback. Without that, we end up with a
      feedback loop where the headphone jack is being driven by amplified
      samples of itself[2].
      
      This change eliminates the update in alc256_init() and replaces it with
      the 0x5757 write from alc_fill_eapd_coef(). Kailang says that 0x5757 is
      supposed to be the codec's default value, so using it will make
      debugging easier for Realtek.
      
      Affects the ALC255, ALC256, ALC257, ALC235, and ALC236 codecs.
      
      [1] Newly documented in Documentation/sound/hd-audio/realtek-pc-beep.rst
      
      [2] Setting the "Headphone Mic Boost" control from userspace changes
      this feedback loop and has been a widely-shared workaround for headphone
      noise on laptops like the Dell XPS 13 9350. This commit eliminates the
      feedback loop and makes the workaround unnecessary.
      
      Fixes: e1e8c1fd ("ALSA: hda/realtek - Dell headphone has noise on unmute for ALC236")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Hebb <tommyhebb@gmail.com>
      Link: https://lore.kernel.org/r/bf22b417d1f2474b12011c2a39ed6cf8b06d3bf5.1585584498.git.tommyhebb@gmail.comSigned-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92b27256
    • Thomas Hebb's avatar
      ALSA: doc: Document PC Beep Hidden Register on Realtek ALC256 · 7cb3c198
      Thomas Hebb authored
      commit f1280904 upstream.
      
      This codec (among others) has a hidden set of audio routes, apparently
      designed to allow PC Beep output without a mixer widget on the output
      path, which are controlled by an undocumented Realtek vendor register.
      The default configuration of these routes means that certain inputs
      aren't accessible, necessitating driver control of the register.
      However, Realtek has provided no documentation of the register, instead
      opting to fix issues by providing magic numbers, most of which have been
      at least somewhat erroneous. These magic numbers then get copied by
      others into model-specific fixups, leading to a fragmented and buggy set
      of configurations.
      
      To get out of this situation, I've reverse engineered the register by
      flipping bits and observing how the codec's behavior changes. This
      commit documents my findings. It does not change any code.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Hebb <tommyhebb@gmail.com>
      Link: https://lore.kernel.org/r/bd69dfdeaf40ff31c4b7b797c829bb320031739c.1585584498.git.tommyhebb@gmail.comSigned-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cb3c198