• Alexander Gordeev's avatar
    Merge branch 'shared-zeropage' into features · 22a49f6d
    Alexander Gordeev authored
    David Hildenbrand says:
    
    ===================
    This series fixes one issue with uffd + shared zeropages on s390x and
    fixes that "ordinary" KVM guests can make use of shared zeropages again.
    
    userfaultfd could currently end up mapping shared zeropages into processes
    that forbid shared zeropages. This only apples to s390x, relevant for
    handling PV guests and guests that use storage kets correctly. Fix it
    by placing a zeroed folio instead of the shared zeropage during
    UFFDIO_ZEROPAGE instead.
    
    I stumbled over this issue while looking into a customer scenario that
    is using:
    
    (1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB
        and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB
        available and additional memory can be "fake hotplugged" to the VM
        later on demand by deflating the balloon. Actual memory overcommit is
        not desired, so physical memory would only be moved between VMs.
    
    (2) Live migration of VMs between sites to evacuate servers in case of
        emergency.
    
    Without the shared zeropage, during (2), the VM would suddenly consume
    100 GiB on the migration source and destination. On the migration source,
    where we don't excpect memory overcommit, we could easilt end up crashing
    the VM during migration.
    
    Independent of that, memory handed back to the hypervisor using "free page
    reporting" would end up consuming actual memory after the migration on the
    destination, not getting freed up until reused+freed again.
    
    While there might be ways to optimize parts of this in QEMU, we really
    should just support the shared zeropage again for ordinary VMs.
    
    We only expect legcy guests to make use of storage keys, so let's handle
    zeropages again when enabling storage keys or when enabling PV. To not
    break userfaultfd like we did in the past, don't zap the shared zeropages,
    but instead trigger unsharing faults, just like we do for unsharing
    KSM pages in break_ksm().
    
    Unsharing faults will simply replace the shared zeropage by a zeroed
    anonymous folio. We can already trigger the same fault path using GUP,
    when trying to long-term pin a shared zeropage, but also when unmerging
    a KSM-placed zeropages, so this is nothing new.
    
    Patch #1 tested on 86-64 by forcing mm_forbids_zeropage() to be 1, and
    running the uffd selftests.
    
    Patch #2 tested on s390x: the live migration scenario now works as
    expected, and kvm-unit-tests that trigger usage of skeys work well, whereby
    I can see detection and unsharing of shared zeropages.
    
    Further (as broken in v2), I tested that the shared zeropage is no
    longer populated after skeys are used -- that mm_forbids_zeropage() works
    as expected:
      ./s390x-run s390x/skey.elf \
       -no-shutdown \
       -chardev socket,id=monitor,path=/var/tmp/mon,server,nowait \
       -mon chardev=monitor,mode=readline
    
      Then, in another shell:
    
      # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
      Rss:               31484 kB
      #  echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
      ...
      # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
      Rss:              160452 kB
    
      -> Reading guest memory does not populate the shared zeropage
    
      Doing the same with selftest.elf (no skeys)
    
      # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
      Rss:               30900 kB
      #  echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
      ...
      # cat /proc/`pgrep qemu`/smaps_rollup | grep Rsstmp/mon
      Rss:               30924 kB
    
      -> Reading guest memory does populate the shared zeropage
    ===================
    Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
    22a49f6d
pgtable.h 56 KB