1. 04 Jun, 2021 2 commits
  2. 03 Jun, 2021 4 commits
    • Mike Rapoport's avatar
      x86/setup: Always reserve the first 1M of RAM · f1d4d47c
      Mike Rapoport authored
      There are BIOSes that are known to corrupt the memory under 1M, or more
      precisely under 640K because the memory above 640K is anyway reserved
      for the EGA/VGA frame buffer and BIOS.
      
      To prevent usage of the memory that will be potentially clobbered by the
      kernel, the beginning of the memory is always reserved. The exact size
      of the reserved area is determined by CONFIG_X86_RESERVE_LOW build time
      and the "reservelow=" command line option. The reserved range may be
      from 4K to 640K with the default of 64K. There are also configurations
      that reserve the entire 1M range, like machines with SandyBridge graphic
      devices or systems that enable crash kernel.
      
      In addition to the potentially clobbered memory, EBDA of unknown size may
      be as low as 128K and the memory above that EBDA start is also reserved
      early.
      
      It would have been possible to reserve the entire range under 1M unless for
      the real mode trampoline that must reside in that area.
      
      To accommodate placement of the real mode trampoline and keep the memory
      safe from being clobbered by BIOS, reserve the first 64K of RAM before
      memory allocations are possible and then, after the real mode trampoline
      is allocated, reserve the entire range from 0 to 1M.
      
      Update trim_snb_memory() and reserve_real_mode() to avoid redundant
      reservations of the same memory range.
      
      Also make sure the memory under 1M is not getting freed by
      efi_free_boot_services().
      
       [ bp: Massage commit message and comments. ]
      
      Fixes: a799c2bd ("x86/setup: Consolidate early memory reservations")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213177
      Link: https://lkml.kernel.org/r/20210601075354.5149-2-rppt@kernel.org
      f1d4d47c
    • Borislav Petkov's avatar
      x86/alternative: Optimize single-byte NOPs at an arbitrary position · 2b31e8ed
      Borislav Petkov authored
      Up until now the assumption was that an alternative patching site would
      have some instructions at the beginning and trailing single-byte NOPs
      (0x90) padding. Therefore, the patching machinery would go and optimize
      those single-byte NOPs into longer ones.
      
      However, this assumption is broken on 32-bit when code like
      hv_do_hypercall() in hyperv_init() would use the ratpoline speculation
      killer CALL_NOSPEC. The 32-bit version of that macro would align certain
      insns to 16 bytes, leading to the compiler issuing a one or more
      single-byte NOPs, depending on the holes it needs to fill for alignment.
      
      That would lead to the warning in optimize_nops() to fire:
      
        ------------[ cut here ]------------
        Not a NOP at 0xc27fb598
         WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:211 optimize_nops.isra.13
      
      due to that function verifying whether all of the following bytes really
      are single-byte NOPs.
      
      Therefore, carve out the NOP padding into a separate function and call
      it for each NOP range beginning with a single-byte NOP.
      
      Fixes: 23c1ad53 ("x86/alternatives: Optimize optimize_nops()")
      Reported-by: default avatarRichard Narron <richard@aaazen.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213301
      Link: https://lkml.kernel.org/r/20210601212125.17145-1-bp@alien8.de
      2b31e8ed
    • Thomas Gleixner's avatar
      x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid() · 9bfecd05
      Thomas Gleixner authored
      While digesting the XSAVE-related horrors which got introduced with
      the supervisor/user split, the recent addition of ENQCMD-related
      functionality got on the radar and turned out to be similarly broken.
      
      update_pasid(), which is only required when X86_FEATURE_ENQCMD is
      available, is invoked from two places:
      
       1) From switch_to() for the incoming task
      
       2) Via a SMP function call from the IOMMU/SMV code
      
      #1 is half-ways correct as it hacks around the brokenness of get_xsave_addr()
         by enforcing the state to be 'present', but all the conditionals in that
         code are completely pointless for that.
      
         Also the invocation is just useless overhead because at that point
         it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task
         and all of this can be handled at return to user space.
      
      #2 is broken beyond repair. The comment in the code claims that it is safe
         to invoke this in an IPI, but that's just wishful thinking.
      
         FPU state of a running task is protected by fregs_lock() which is
         nothing else than a local_bh_disable(). As BH-disabled regions run
         usually with interrupts enabled the IPI can hit a code section which
         modifies FPU state and there is absolutely no guarantee that any of the
         assumptions which are made for the IPI case is true.
      
         Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is
         invoked with a NULL pointer argument, so it can hit a completely
         unrelated task and unconditionally force an update for nothing.
         Worse, it can hit a kernel thread which operates on a user space
         address space and set a random PASID for it.
      
      The offending commit does not cleanly revert, but it's sufficient to
      force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid()
      code to make this dysfunctional all over the place. Anything more
      complex would require more surgery and none of the related functions
      outside of the x86 core code are blatantly wrong, so removing those
      would be overkill.
      
      As nothing enables the PASID bit in the IA32_XSS MSR yet, which is
      required to make this actually work, this cannot result in a regression
      except for related out of tree train-wrecks, but they are broken already
      today.
      
      Fixes: 20f0afd1 ("x86/mmu: Allocate/free a PASID")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
      9bfecd05
    • Borislav Petkov's avatar
      dmaengine: idxd: Use cpu_feature_enabled() · 74b2fc88
      Borislav Petkov authored
      When testing x86 feature bits, use cpu_feature_enabled() so that
      build-disabled features can remain off, regardless of what CPUID says.
      
      Fixes: 8e50d392 ("dmaengine: idxd: Add shared workqueue support")
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-By: default avatarVinod Koul <vkoul@kernel.org>
      Cc: <stable@vger.kernel.org>
      74b2fc88
  3. 31 May, 2021 1 commit
    • Borislav Petkov's avatar
      x86/thermal: Fix LVT thermal setup for SMI delivery mode · 9a90ed06
      Borislav Petkov authored
      There are machines out there with added value crap^WBIOS which provide an
      SMI handler for the local APIC thermal sensor interrupt. Out of reset,
      the BSP on those machines has something like 0x200 in that APIC register
      (timestamps left in because this whole issue is timing sensitive):
      
        [    0.033858] read lvtthmr: 0x330, val: 0x200
      
      which means:
      
       - bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
       - bits [10:8] have 010b which means SMI delivery mode.
      
      Now, later during boot, when the kernel programs the local APIC, it
      soft-disables it temporarily through the spurious vector register:
      
        setup_local_APIC:
      
        	...
      
      	/*
      	 * If this comes from kexec/kcrash the APIC might be enabled in
      	 * SPIV. Soft disable it before doing further initialization.
      	 */
      	value = apic_read(APIC_SPIV);
      	value &= ~APIC_SPIV_APIC_ENABLED;
      	apic_write(APIC_SPIV, value);
      
      which means (from the SDM):
      
      "10.4.7.2 Local APIC State After It Has Been Software Disabled
      
      ...
      
      * The mask bits for all the LVT entries are set. Attempts to reset these
      bits will be ignored."
      
      And this happens too:
      
        [    0.124111] APIC: Switch to symmetric I/O mode setup
        [    0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
        [    0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
      
      This results in CPU 0 soft lockups depending on the placement in time
      when the APIC soft-disable happens. Those soft lockups are not 100%
      reproducible and the reason for that can only be speculated as no one
      tells you what SMM does. Likely, it confuses the SMM code that the APIC
      is disabled and the thermal interrupt doesn't doesn't fire at all,
      leading to CPU 0 stuck in SMM forever...
      
      Now, before
      
        4f432e8b ("x86/mce: Get rid of mcheck_intel_therm_init()")
      
      due to how the APIC_LVTTHMR was read before APIC initialization in
      mcheck_intel_therm_init(), it would read the value with the mask bit 16
      clear and then intel_init_thermal() would replicate it onto the APs and
      all would be peachy - the thermal interrupt would remain enabled.
      
      But that commit moved that reading to a later moment in
      intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
      late and with its interrupt mask bit set.
      
      Thus, revert back to the old behavior of reading the thermal LVT
      register before the APIC gets initialized.
      
      Fixes: 4f432e8b ("x86/mce: Get rid of mcheck_intel_therm_init()")
      Reported-by: default avatarJames Feeney <james@nurealm.net>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
      9a90ed06
  4. 29 May, 2021 1 commit
    • Thomas Gleixner's avatar
      x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing · 7d65f9e8
      Thomas Gleixner authored
      PIC interrupts do not support affinity setting and they can end up on
      any online CPU. Therefore, it's required to mark the associated vectors
      as system-wide reserved. Otherwise, the corresponding irq descriptors
      are copied to the secondary CPUs but the vectors are not marked as
      assigned or reserved. This works correctly for the IO/APIC case.
      
      When the IO/APIC is disabled via config, kernel command line or lack of
      enumeration then all legacy interrupts are routed through the PIC, but
      nothing marks them as system-wide reserved vectors.
      
      As a consequence, a subsequent allocation on a secondary CPU can result in
      allocating one of these vectors, which triggers the BUG() in
      apic_update_vector() because the interrupt descriptor slot is not empty.
      
      Imran tried to work around that by marking those interrupts as allocated
      when a CPU comes online. But that's wrong in case that the IO/APIC is
      available and one of the legacy interrupts, e.g. IRQ0, has been switched to
      PIC mode because then marking them as allocated will fail as they are
      already marked as system vectors.
      
      Stay consistent and update the legacy vectors after attempting IO/APIC
      initialization and mark them as system vectors in case that no IO/APIC is
      available.
      
      Fixes: 69cde000 ("x86/vector: Use matrix allocator for vector assignment")
      Reported-by: default avatarImran Khan <imran.f.khan@oracle.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
      7d65f9e8
  5. 23 May, 2021 18 commits
  6. 22 May, 2021 4 commits
    • Linus Torvalds's avatar
      Merge tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block · 4ff2473b
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - Fix BLKRRPART and deletion race (Gulam, Christoph)
      
       - NVMe pull request (Christoph):
            - nvme-tcp corruption and timeout fixes (Sagi Grimberg, Keith
              Busch)
            - nvme-fc teardown fix (James Smart)
            - nvmet/nvme-loop memory leak fixes (Wu Bo)"
      
      * tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
        block: fix a race between del_gendisk and BLKRRPART
        block: prevent block device lookups at the beginning of del_gendisk
        nvme-fc: clear q_live at beginning of association teardown
        nvme-tcp: rerun io_work if req_list is not empty
        nvme-tcp: fix possible use-after-completion
        nvme-loop: fix memory leak in nvme_loop_create_ctrl()
        nvmet: fix memory leak in nvmet_alloc_ctrl()
      4ff2473b
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block · b9231dfb
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "One fix for a regression with poll in this merge window, and another
        just hardens the io-wq exit path a bit"
      
      * tag 'io_uring-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
        io_uring: fortify tctx/io_wq cleanup
        io_uring: don't modify req->poll for rw
      b9231dfb
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.13b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 23d72926
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - a fix for a boot regression when running as PV guest on hardware
         without NX support
      
       - a small series fixing a bug in the Xen pciback driver when
         configuring a PCI card with multiple virtual functions
      
      * tag 'for-linus-5.13b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen-pciback: reconfigure also from backend watch handler
        xen-pciback: redo VF placement in the virtual topology
        x86/Xen: swap NX determination and GDT setup on BSP
      23d72926
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · a3969ef4
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
      
       - Fix some math errors in the realtime allocator when extent size hints
         are applied.
      
       - Fix unnecessary short writes to realtime files when free space is
         fragmented.
      
       - Fix a crash when using scrub tracepoints.
      
       - Restore ioctl uapi definitions that were accidentally removed in
         5.13-rc1.
      
      * tag 'xfs-5.13-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: restore old ioctl definitions
        xfs: fix deadlock retry tracepoint arguments
        xfs: retry allocations when locality-based search fails
        xfs: adjust rt allocation minlen when extszhint > rtextsize
      a3969ef4
  7. 21 May, 2021 10 commits