1. 05 May, 2024 1 commit
  2. 02 May, 2024 1 commit
    • Alexander Gordeev's avatar
      Merge branch 'shared-zeropage' into features · 22a49f6d
      Alexander Gordeev authored
      David Hildenbrand says:
      
      ===================
      This series fixes one issue with uffd + shared zeropages on s390x and
      fixes that "ordinary" KVM guests can make use of shared zeropages again.
      
      userfaultfd could currently end up mapping shared zeropages into processes
      that forbid shared zeropages. This only apples to s390x, relevant for
      handling PV guests and guests that use storage kets correctly. Fix it
      by placing a zeroed folio instead of the shared zeropage during
      UFFDIO_ZEROPAGE instead.
      
      I stumbled over this issue while looking into a customer scenario that
      is using:
      
      (1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB
          and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB
          available and additional memory can be "fake hotplugged" to the VM
          later on demand by deflating the balloon. Actual memory overcommit is
          not desired, so physical memory would only be moved between VMs.
      
      (2) Live migration of VMs between sites to evacuate servers in case of
          emergency.
      
      Without the shared zeropage, during (2), the VM would suddenly consume
      100 GiB on the migration source and destination. On the migration source,
      where we don't excpect memory overcommit, we could easilt end up crashing
      the VM during migration.
      
      Independent of that, memory handed back to the hypervisor using "free page
      reporting" would end up consuming actual memory after the migration on the
      destination, not getting freed up until reused+freed again.
      
      While there might be ways to optimize parts of this in QEMU, we really
      should just support the shared zeropage again for ordinary VMs.
      
      We only expect legcy guests to make use of storage keys, so let's handle
      zeropages again when enabling storage keys or when enabling PV. To not
      break userfaultfd like we did in the past, don't zap the shared zeropages,
      but instead trigger unsharing faults, just like we do for unsharing
      KSM pages in break_ksm().
      
      Unsharing faults will simply replace the shared zeropage by a zeroed
      anonymous folio. We can already trigger the same fault path using GUP,
      when trying to long-term pin a shared zeropage, but also when unmerging
      a KSM-placed zeropages, so this is nothing new.
      
      Patch #1 tested on 86-64 by forcing mm_forbids_zeropage() to be 1, and
      running the uffd selftests.
      
      Patch #2 tested on s390x: the live migration scenario now works as
      expected, and kvm-unit-tests that trigger usage of skeys work well, whereby
      I can see detection and unsharing of shared zeropages.
      
      Further (as broken in v2), I tested that the shared zeropage is no
      longer populated after skeys are used -- that mm_forbids_zeropage() works
      as expected:
        ./s390x-run s390x/skey.elf \
         -no-shutdown \
         -chardev socket,id=monitor,path=/var/tmp/mon,server,nowait \
         -mon chardev=monitor,mode=readline
      
        Then, in another shell:
      
        # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
        Rss:               31484 kB
        #  echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
        ...
        # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
        Rss:              160452 kB
      
        -> Reading guest memory does not populate the shared zeropage
      
        Doing the same with selftest.elf (no skeys)
      
        # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
        Rss:               30900 kB
        #  echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
        ...
        # cat /proc/`pgrep qemu`/smaps_rollup | grep Rsstmp/mon
        Rss:               30924 kB
      
        -> Reading guest memory does populate the shared zeropage
      ===================
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      22a49f6d
  3. 01 May, 2024 2 commits
    • Nina Schoetterl-Glausch's avatar
      KVM: s390: vsie: Use virt_to_phys for crypto control block · cc4edb92
      Nina Schoetterl-Glausch authored
      The address of the crypto control block in the (shadow) SIE block is
      absolute/physical.
      Convert from virtual to physical when shadowing the guest's control
      block during VSIE.
      Signed-off-by: default avatarNina Schoetterl-Glausch <nsg@linux.ibm.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@linux.ibm.com>
      Acked-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Link: https://lore.kernel.org/r/20240429171512.879215-1-nsg@linux.ibm.comSigned-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      cc4edb92
    • Alexander Gordeev's avatar
      s390: Relocate vmlinux ELF data to virtual address space · 9ecaa2e9
      Alexander Gordeev authored
      Currently kernel image relocation tables and other ELF
      data are set to base zero. Since kernel virtual and
      physical address spaces are uncoupled the kernel is
      mapped at the top of the virtual address space, hence
      making the information contained in vmlinux ELF tables
      inconsistent.
      
      That does not pose any issue with regard to the kernel
      booting and operation, but makes it difficult to use a
      generated vmlinux with some debugging tools (e.g. gdb).
      
      Relocate vmlinux image base address from zero to a base
      address in the virtual address space. It is the address
      that kernel is mapped to in cases KASLR is disabled.
      
      The vmlinux ELF header before and after this change looks
      like this:
      
      Elf file type is EXEC (Executable file)
      Entry point 0x100000
      There are 3 program headers, starting at offset 64
      
      Program Headers:
        Type           Offset             VirtAddr           PhysAddr
                       FileSiz            MemSiz              Flags  Align
        LOAD           0x0000000000001000 0x0000000000100000 0x0000000000100000
                       0x0000000001323378 0x0000000001323378  R E    0x1000
        LOAD           0x0000000001325000 0x0000000001424000 0x0000000001424000
                       0x00000000003a4200 0x000000000048fdb8  RWE    0x1000
        NOTE           0x00000000012a33b0 0x00000000013a23b0 0x00000000013a23b0
                       0x0000000000000054 0x0000000000000054         0x4
      
      Elf file type is EXEC (Executable file)
      Entry point 0x3ffe0000000
      There are 3 program headers, starting at offset 64
      
      Program Headers:
        Type           Offset             VirtAddr           PhysAddr
                       FileSiz            MemSiz              Flags  Align
        LOAD           0x0000000000001000 0x000003ffe0000000 0x000003ffe0000000
                       0x0000000001323378 0x0000000001323378  R E    0x1000
        LOAD           0x0000000001325000 0x000003ffe1324000 0x000003ffe1324000
                       0x00000000003a4200 0x000000000048fdb8  RWE    0x1000
        NOTE           0x00000000012a33b0 0x000003ffe12a23b0 0x000003ffe12a23b0
                       0x0000000000000054 0x0000000000000054         0x4
      Suggested-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      9ecaa2e9
  4. 29 Apr, 2024 6 commits
  5. 22 Apr, 2024 5 commits
  6. 18 Apr, 2024 2 commits
    • David Hildenbrand's avatar
      s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests · 06201e00
      David Hildenbrand authored
      commit fa41ba0d ("s390/mm: avoid empty zero pages for KVM guests to
      avoid postcopy hangs") introduced an undesired side effect when combined
      with memory ballooning and VM migration: memory part of the inflated
      memory balloon will consume memory.
      
      Assuming we have a 100GiB VM and inflated the balloon to 40GiB. Our VM
      will consume ~60GiB of memory. If we now trigger a VM migration,
      hypervisors like QEMU will read all VM memory. As s390x does not support
      the shared zeropage, we'll end up allocating for all previously-inflated
      memory part of the memory balloon: 50 GiB. So we might easily
      (unexpectedly) crash the VM on the migration source.
      
      Even worse, hypervisors like QEMU optimize for zeropage migration to not
      consume memory on the migration destination: when migrating a
      "page full of zeroes", on the migration destination they check whether the
      target memory is already zero (by reading the destination memory) and avoid
      writing to the memory to not allocate memory: however, s390x will also
      allocate memory here, implying that also on the migration destination, we
      will end up allocating all previously-inflated memory part of the memory
      balloon.
      
      This is especially bad if actual memory overcommit was not desired, when
      memory ballooning is used for dynamic VM memory resizing, setting aside
      some memory during boot that can be added later on demand. Alternatives
      like virtio-mem that would avoid this issue are not yet available on
      s390x.
      
      There could be ways to optimize some cases in user space: before reading
      memory in an anonymous private mapping on the migration source, check via
      /proc/self/pagemap if anything is already populated. Similarly check on
      the migration destination before reading. While that would avoid
      populating tables full of shared zeropages on all architectures, it's
      harder to get right and performant, and requires user space changes.
      
      Further, with posctopy live migration we must place a page, so there,
      "avoid touching memory to avoid allocating memory" is not really
      possible. (Note that a previously we would have falsely inserted
      shared zeropages into processes using UFFDIO_ZEROPAGE where
      mm_forbids_zeropage() would have actually forbidden it)
      
      PV is currently incompatible with memory ballooning, and in the common
      case, KVM guests don't make use of storage keys. Instead of zapping
      zeropages when enabling storage keys / PV, that turned out to be
      problematic in the past, let's do exactly the same we do with KSM pages:
      trigger unsharing faults to replace the shared zeropages by proper
      anonymous folios.
      
      What about added latency when enabling storage kes? Having a lot of
      zeropages in applicable environments (PV, legacy guests, unittests) is
      unexpected. Further, KSM could today already unshare the zeropages
      and unmerging KSM pages when enabling storage kets would unshare the
      KSM-placed zeropages in the same way, resulting in the same latency.
      
      [ agordeev: Fixed sparse and checkpatch complaints and error handling ]
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@linux.ibm.com>
      Tested-by: default avatarChristian Borntraeger <borntraeger@linux.ibm.com>
      Fixes: fa41ba0d ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20240411161441.910170-3-david@redhat.comSigned-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      06201e00
    • David Hildenbrand's avatar
      mm/userfaultfd: Do not place zeropages when zeropages are disallowed · 90a7592d
      David Hildenbrand authored
      s390x must disable shared zeropages for processes running VMs, because
      the VMs could end up making use of "storage keys" or protected
      virtualization, which are incompatible with shared zeropages.
      
      Yet, with userfaultfd it is possible to insert shared zeropages into
      such processes. Let's fallback to simply allocating a fresh zeroed
      anonymous folio and insert that instead.
      
      mm_forbids_zeropage() was introduced in commit 593befa6 ("mm: introduce
      mm_forbids_zeropage function"), briefly before userfaultfd went
      upstream.
      
      Note that we don't want to fail the UFFDIO_ZEROPAGE request like we do
      for hugetlb, it would be rather unexpected. Further, we also
      cannot really indicated "not supported" to user space ahead of time: it
      could be that the MM disallows zeropages after userfaultfd was already
      registered.
      
      [ agordeev: Fixed checkpatch complaints ]
      
      Fixes: c1a4de99 ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Link: https://lore.kernel.org/r/20240411161441.910170-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      90a7592d
  7. 17 Apr, 2024 20 commits
    • Vasily Gorbik's avatar
      s390/expoline: Make modules use kernel expolines · ba05b39d
      Vasily Gorbik authored
      Currently, kernel modules contain their own set of expoline thunks. In
      the case of EXPOLINE_EXTERN, this involves postlinking of precompiled
      expoline.o. expoline.o is also necessary for out-of-source tree module
      builds.
      
      Now that the kernel modules area is less than 4 GB away from
      kernel expoline thunks, make modules use kernel expolines. Also make
      EXPOLINE_EXTERN the default if the compiler supports it. This simplifies
      build and aligns with the approach adopted by other architectures.
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      ba05b39d
    • Vasily Gorbik's avatar
      s390/nospec: Correct modules thunk offset calculation · ea84f14d
      Vasily Gorbik authored
      Fix offset calculation when branch target is more then 2Gb away.
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      ea84f14d
    • Alexander Gordeev's avatar
      s390/boot: Do not rescue .vmlinux.relocs section · 236d70f8
      Alexander Gordeev authored
      The .vmlinux.relocs section is moved in front of the compressed
      kernel. The interim section rescue step is avoided as result.
      Suggested-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      236d70f8
    • Alexander Gordeev's avatar
      s390/boot: Rework deployment of the kernel image · 56b1069c
      Alexander Gordeev authored
      Rework deployment of kernel image for both compressed and
      uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED
      kernel configuration variable.
      
      In case CONFIG_KERNEL_UNCOMPRESSED is disabled avoid uncompressing
      the kernel to a temporary buffer and copying it to the target
      address. Instead, uncompress it directly to the target destination.
      
      In case CONFIG_KERNEL_UNCOMPRESSED is enabled avoid moving the
      kernel to default 0x100000 location when KASLR is disabled or
      failed. Instead, use the uncompressed kernel image directly.
      
      In case KASLR is disabled or failed .amode31 section location in
      memory is not randomized and precedes the kernel image. In case
      CONFIG_KERNEL_UNCOMPRESSED is disabled that location overlaps the
      area used by the decompression algorithm. That is fine, since that
      area is not used after the decompression finished and the size of
      .amode31 section is not expected to exceed BOOT_HEAP_SIZE ever.
      
      There is no decompression in case CONFIG_KERNEL_UNCOMPRESSED is
      enabled. Therefore, rename decompress_kernel() to deploy_kernel(),
      which better describes both uncompressed and compressed cases.
      
      Introduce AMODE31_SIZE macro to avoid immediate value of 0x3000
      (the size of .amode31 section) in the decompressor linker script.
      Modify the vmlinux linker script to force the size of .amode31
      section to AMODE31_SIZE (the value of (_eamode31 - _samode31)
      could otherwise differ as result of compiler options used).
      
      Introduce __START_KERNEL macro that defines the kernel ELF image
      entry point and set it to the currrent value of 0x100000.
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      56b1069c
    • Alexander Gordeev's avatar
      s390: Map kernel at fixed location when KASLR is disabled · 54f2ecc3
      Alexander Gordeev authored
      Since kernel virtual and physical address spaces are
      uncoupled the kernel is mapped at the top of the virtual
      address space in case KASLR is disabled.
      
      That does not pose any issue with regard to the kernel
      booting and operation, but makes it difficult to use a
      generated vmlinux with some debugging tools (e.g. gdb),
      because the exact location of the kernel image in virtual
      memory is unknown. Make that location known and introduce
      CONFIG_KERNEL_IMAGE_BASE configuration option.
      
      A custom CONFIG_KERNEL_IMAGE_BASE value that would break
      the virtual memory layout leads to a build error.
      
      The kernel image size is defined by KERNEL_IMAGE_SIZE
      macro and set to 512 MB, by analogy with x86.
      Suggested-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      54f2ecc3
    • Alexander Gordeev's avatar
      s390/mm: Uncouple physical vs virtual address spaces · c98d2eca
      Alexander Gordeev authored
      The uncoupling physical vs virtual address spaces brings
      the following benefits to s390:
      
      - virtual memory layout flexibility;
      - closes the address gap between kernel and modules, it
        caused s390-only problems in the past (e.g. 'perf' bugs);
      - allows getting rid of trampolines used for module calls
        into kernel;
      - allows simplifying BPF trampoline;
      - minor performance improvement in branch prediction;
      - kernel randomization entropy is magnitude bigger, as it is
        derived from the amount of available virtual, not physical
        memory;
      
      The whole change could be described in two pictures below:
      before and after the change.
      
      Some aspects of the virtual memory layout setup are not
      clarified (number of page levels, alignment, DMA memory),
      since these are not a part of this change or secondary
      with regard to how the uncoupling itself is implemented.
      
      The focus of the pictures is to explain why __va() and __pa()
      macros are implemented the way they are.
      
              Memory layout in V==R mode:
      
      |    Physical      |    Virtual       |
      +- 0 --------------+- 0 --------------+ identity mapping start
      |                  | S390_lowcore     | Low-address memory
      |                  +- 8 KB -----------+
      |                  |                  |
      |                  | identity         | phys == virt
      |                  | mapping          | virt == phys
      |                  |                  |
      +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
      |.amode31 text/data|.amode31 text/data|
      +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt start
      |                  |                  |
      |                  |                  |
      +- __kaslr_offset, __kaslr_offset_phys| kernel rand. phys/virt start
      |                  |                  |
      | kernel text/data | kernel text/data | phys == kvirt
      |                  |                  |
      +------------------+------------------+ kernel phys/virt end
      |                  |                  |
      |                  |                  |
      |                  |                  |
      |                  |                  |
      +- ident_map_size -+- ident_map_size -+ identity mapping end
                         |                  |
                         |  ... unused gap  |
                         |                  |
                         +---- vmemmap -----+ 'struct page' array start
                         |                  |
                         | virtually mapped |
                         | memory map       |
                         |                  |
                         +- __abs_lowcore --+
                         |                  |
                         | Absolute Lowcore |
                         |                  |
                         +- __memcpy_real_area
                         |                  |
                         |  Real Memory Copy|
                         |                  |
                         +- VMALLOC_START --+ vmalloc area start
                         |                  |
                         |  vmalloc area    |
                         |                  |
                         +- MODULES_VADDR --+ modules area start
                         |                  |
                         |  modules area    |
                         |                  |
                         +------------------+ UltraVisor Secure Storage limit
                         |                  |
                         |  ... unused gap  |
                         |                  |
                         +KASAN_SHADOW_START+ KASAN shadow memory start
                         |                  |
                         |   KASAN shadow   |
                         |                  |
                         +------------------+ ASCE limit
      
              Memory layout in V!=R mode:
      
      |    Physical      |    Virtual       |
      +- 0 --------------+- 0 --------------+
      |                  | S390_lowcore     | Low-address memory
      |                  +- 8 KB -----------+
      |                  |                  |
      |                  |                  |
      |                  | ... unused gap   |
      |                  |                  |
      +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
      |.amode31 text/data|.amode31 text/data|
      +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB)
      |                  |                  |
      |                  |                  |
      +- __kaslr_offset_phys		     | kernel rand. phys start
      |                  |                  |
      | kernel text/data |                  |
      |                  |                  |
      +------------------+		     | kernel phys end
      |                  |                  |
      |                  |                  |
      |                  |                  |
      |                  |                  |
      +- ident_map_size -+		     |
                         |                  |
                         |  ... unused gap  |
                         |                  |
                         +- __identity_base + identity mapping start (>= 2GB)
                         |                  |
                         | identity         | phys == virt - __identity_base
                         | mapping          | virt == phys + __identity_base
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         |                  |
                         +---- vmemmap -----+ 'struct page' array start
                         |                  |
                         | virtually mapped |
                         | memory map       |
                         |                  |
                         +- __abs_lowcore --+
                         |                  |
                         | Absolute Lowcore |
                         |                  |
                         +- __memcpy_real_area
                         |                  |
                         |  Real Memory Copy|
                         |                  |
                         +- VMALLOC_START --+ vmalloc area start
                         |                  |
                         |  vmalloc area    |
                         |                  |
                         +- MODULES_VADDR --+ modules area start
                         |                  |
                         |  modules area    |
                         |                  |
                         +- __kaslr_offset -+ kernel rand. virt start
                         |                  |
                         | kernel text/data | phys == (kvirt - __kaslr_offset) +
                         |                  |         __kaslr_offset_phys
                         +- kernel .bss end + kernel rand. virt end
                         |                  |
                         |  ... unused gap  |
                         |                  |
                         +------------------+ UltraVisor Secure Storage limit
                         |                  |
                         |  ... unused gap  |
                         |                  |
                         +KASAN_SHADOW_START+ KASAN shadow memory start
                         |                  |
                         |   KASAN shadow   |
                         |                  |
                         +------------------+ ASCE limit
      
      Unused gaps in the virtual memory layout could be present
      or not - depending on how partucular system is configured.
      No page tables are created for the unused gaps.
      
      The relative order of vmalloc, modules and kernel image in
      virtual memory is defined by following considerations:
      
      - start of the modules area and end of the kernel should reside
        within 4GB to accommodate relative 32-bit jumps. The best way
        to achieve that is to place kernel next to modules;
      
      - vmalloc and module areas should locate next to each other
        to prevent failures and extra reworks in user level tools
        (makedumpfile, crash, etc.) which treat vmalloc and module
        addresses similarily;
      
      - kernel needs to be the last area in the virtual memory
        layout to easily distinguish between kernel and non-kernel
        virtual addresses. That is needed to (again) simplify
        handling of addresses in user level tools and make __pa()
        macro faster (see below);
      
      Concluding the above, the relative order of the considered
      virtual areas in memory is: vmalloc - modules - kernel.
      Therefore, the only change to the current memory layout is
      moving kernel to the end of virtual address space.
      
      With that approach the implementation of __pa() macro is
      straightforward - all linear virtual addresses less than
      kernel base are considered identity mapping:
      
      	phys == virt - __identity_base
      
      All addresses greater than kernel base are kernel ones:
      
      	phys == (kvirt - __kaslr_offset) + __kaslr_offset_phys
      
      By contrast, __va() macro deals only with identity mapping
      addresses:
      
      	virt == phys + __identity_base
      
      .amode31 section is mapped separately and is not covered by
      __pa() macro. In fact, it could have been handled easily by
      checking whether a virtual address is within the section or
      not, but there is no need for that. Thus, let __pa() code
      do as little machine cycles as possible.
      
      The KASAN shadow memory is located at the very end of the
      virtual memory layout, at addresses higher than the kernel.
      However, that is not a linear mapping and no code other than
      KASAN instrumentation or API is expected to access it.
      
      When KASLR mode is enabled the kernel base address randomized
      within a memory window that spans whole unused virtual address
      space. The size of that window depends from the amount of
      physical memory available to the system, the limit imposed by
      UltraVisor (if present) and the vmalloc area size as provided
      by vmalloc= kernel command line parameter.
      
      In case the virtual memory is exhausted the minimum size of
      the randomization window is forcefully set to 2GB, which
      amounts to in 15 bits of entropy if KASAN is enabled or 17
      bits of entropy in default configuration.
      
      The default kernel offset 0x100000 is used as a magic value
      both in the decompressor code and vmlinux linker script, but
      it will be removed with a follow-up change.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      c98d2eca
    • Alexander Gordeev's avatar
      s390/crash: Use old os_info to create PT_LOAD headers · f4cac27d
      Alexander Gordeev authored
      This is a preparatory rework to allow uncoupling virtual
      and physical addresses spaces.
      
      The vmcore ELF program headers describe virtual memory
      regions of a crashed kernel. User level tools use that
      information for the kernel text and data analysis (e.g
      vmcore-dmesg extracts the kernel log).
      
      Currently the kernel image is covered by program headers
      describing the identity mapping regions. But in the future
      the kernel image will be mapped into separate region outside
      of the identity mapping. Create the additional ELF program
      header that covers kernel image only, so that vmcore tools
      could locate kernel text and data.
      
      Further, the identity mapping in crashed and capture kernels
      will have different base address. Due to that __va() macro
      can not be used in the capture kernel. Instead, read crashed
      kernel identity mapping base address from os_info and use
      it for PT_LOAD type program headers creation.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      f4cac27d
    • Alexander Gordeev's avatar
      s390/vmcoreinfo: Store virtual memory layout · 378e32aa
      Alexander Gordeev authored
      This is a preparatory rework to allow uncoupling virtual
      and physical addresses spaces.
      
      The virtual memory layout is needed for address translation
      by crash tool when /proc/kcore device is used as the memory
      image.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      378e32aa
    • Alexander Gordeev's avatar
      s390/os_info: Store virtual memory layout · 8572f525
      Alexander Gordeev authored
      This is a preparatory rework to allow uncoupling virtual
      and physical addresses spaces.
      
      The virtual memory layout will be read out by makedumpfile,
      crash and other user tools for virtual address translation.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      8572f525
    • Alexander Gordeev's avatar
      s390/os_info: Introduce value entries · 88702793
      Alexander Gordeev authored
      Introduce entries that do not reference any data in memory,
      but rather provide values. Set the size of such entries to
      zero and do not compute checksum for them, since there is no
      data which integrity needs to be checked. The integrity of
      the value entries itself is still covered by the os_info
      checksum.
      
      Reserve the lowest unused entry index OS_INFO_RESERVED for
      future use - presumably for the number of entries present.
      That could later be used by user level tools. The existing
      tools would not notice any difference.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      88702793
    • Alexander Gordeev's avatar
      s390/boot: Make .amode31 section address range explicit · 5fb50fa6
      Alexander Gordeev authored
      This is a preparatory rework to allow uncoupling virtual
      and physical addresses spaces.
      
      Introduce .amode31 section address range AMODE31_START
      and AMODE31_END macros for later use.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      5fb50fa6
    • Alexander Gordeev's avatar
      s390/boot: Make identity mapping base address explicit · 7de0446f
      Alexander Gordeev authored
      This is a preparatory rework to allow uncoupling virtual
      and physical addresses spaces.
      
      Currently the identity mapping base address is implicit
      and is always set to zero. Make it explicit by putting
      into __identity_base persistent boot variable and use it
      in proper context - which is the value of PAGE_OFFSET.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      7de0446f
    • Alexander Gordeev's avatar
      s390/boot: Uncouple virtual and physical kernel offsets · 3bb11234
      Alexander Gordeev authored
      This is a preparatory rework to allow uncoupling virtual
      and physical addresses spaces.
      
      Currently __kaslr_offset is the kernel offset in both
      physical memory on boot and in virtual memory after DAT
      mode is enabled.
      
      Uncouple these offsets and rename the physical address
      space variant to __kaslr_offset_phys while keep the name
      __kaslr_offset for the offset in virtual address space.
      
      Do not use __kaslr_offset_phys after DAT mode is enabled
      just yet, but still make it a persistent boot variable
      for later use.
      
      Use __kaslr_offset and __kaslr_offset_phys offsets in
      proper contexts and alter handle_relocs() function to
      distinguish between the two.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      3bb11234
    • Alexander Gordeev's avatar
      s390/mm: Create virtual memory layout structure · 236f324b
      Alexander Gordeev authored
      This is a preparatory rework to allow uncoupling virtual
      and physical addresses spaces.
      
      Put virtual memory layout information into a structure
      to improve code generation when accessing the structure
      members, which are currently only ident_map_size and
      __kaslr_offset.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      236f324b
    • Alexander Gordeev's avatar
      s390/mm: Move KASLR related to <asm/page.h> · bbe72f39
      Alexander Gordeev authored
      Move everyting KASLR related to <asm/page.h>,
      similarly to many other architectures.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Suggested-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      bbe72f39
    • Alexander Gordeev's avatar
      s390/boot: Swap vmalloc and Lowcore/Real Memory Copy areas · c8aef260
      Alexander Gordeev authored
      This is a preparatory rework to allow uncoupling virtual
      and physical addresses spaces.
      
      Currently the order of virtual memory areas is (the lowcore
      and .amode31 section are skipped, as it is irrelevant):
      
      	identity mapping (the kernel is contained within)
      	vmemmap
      	vmalloc
      	modules
      	Absolute Lowcore
      	Real Memory Copy
      
      In the future the kernel will be mapped separately and placed
      to the end of the virtual address space, so the layout would
      turn like this:
      
      	identity mapping
      	vmemmap
      	vmalloc
      	modules
      	Absolute Lowcore
      	Real Memory Copy
      	kernel
      
      However, the distance between kernel and modules needs to be as
      little as possible, ideally - none. Thus, the Absolute Lowcore
      and Real Memory Copy areas would stay in the way and therefore
      need to be moved as well:
      
      	identity mapping
      	vmemmap
      	Absolute Lowcore
      	Real Memory Copy
      	vmalloc
      	modules
      	kernel
      
      To facilitate such layout swap the vmalloc and Absolute Lowcore
      together with Real Memory Copy areas. As result, the current
      layout turns into:
      
      	identity mapping (the kernel is contained within)
      	vmemmap
      	Absolute Lowcore
      	Real Memory Copy
      	vmalloc
      	modules
      
      This will allow to locate the kernel directly next to the
      modules once it gets mapped separately.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      c8aef260
    • Alexander Gordeev's avatar
      s390/boot: Reduce size of identity mapping on overlap · ecf74da6
      Alexander Gordeev authored
      In case vmemmap array could overlap with vmalloc area on
      virtual memory layout setup, the size of vmalloc area
      is decreased. That could result in less memory than user
      requested with vmalloc= kernel command line parameter.
      Instead, reduce the size of identity mapping (and the
      size of vmemmap array as result) to avoid such overlap.
      
      Further, currently the virtual memmory allocation "rolls"
      from top to bottom and it is only VMALLOC_START that could
      get increased due to the overlap. Change that to decrease-
      only, which makes the whole allocation algorithm more easy
      to comprehend.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      ecf74da6
    • Alexander Gordeev's avatar
      s390/boot: Consider DCSS segments on memory layout setup · b2b15f07
      Alexander Gordeev authored
      The maximum mappable physical address (as returned by
      arch_get_mappable_range() callback) is limited by the
      value of (1UL << MAX_PHYSMEM_BITS).
      
      The maximum physical address available to a DCSS segment
      is 512GB.
      
      In case the available online or offline memory size is less
      than the DCSS limit arch_get_mappable_range() would include
      never used [512GB..(1UL << MAX_PHYSMEM_BITS)] range.
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      b2b15f07
    • Alexander Gordeev's avatar
      s390/boot: Do not force vmemmap to start at MAX_PHYSMEM_BITS · 47bf8176
      Alexander Gordeev authored
      vmemmap is forcefully set to start at MAX_PHYSMEM_BITS at most.
      That could be needed in the past to limit ident_map_size to
      MAX_PHYSMEM_BITS. However since commit 75eba6ec0de1 ("s390:
      unify identity mapping limits handling") ident_map_size is
      limited in setup_ident_map_size() function, which is called
      earlier.
      
      Another reason to limit vmemmap start to MAX_PHYSMEM_BITS is
      because it was returned by arch_get_mappable_range() as the
      maximum mappable physical address. Since commit f641679dfe55
      ("s390/mm: rework arch_get_mappable_range() callback") that
      is not required anymore.
      
      As result, there is no neccessity to limit vmemmap starting
      address with MAX_PHYSMEM_BITS.
      Reviewed-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      47bf8176
    • Nina Schoetterl-Glausch's avatar
      KVM: s390: vsie: Use virt_to_phys for facility control block · 22fdd8ba
      Nina Schoetterl-Glausch authored
      In order for SIE to interpretively execute STFLE, it requires the real
      or absolute address of a facility-list control block.
      Before writing the location into the shadow SIE control block, convert
      it from a virtual address.
      We currently do not run into this bug because the lower 31 bits are the
      same for virtual and physical addresses.
      Signed-off-by: default avatarNina Schoetterl-Glausch <nsg@linux.ibm.com>
      Link: https://lore.kernel.org/r/20240319164420.4053380-3-nsg@linux.ibm.comSigned-off-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Message-Id: <20240319164420.4053380-3-nsg@linux.ibm.com>
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      22fdd8ba
  8. 12 Apr, 2024 3 commits