1. 20 Apr, 2020 1 commit
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Handle non-present PTEs in page fault functions · ae49deda
      Paul Mackerras authored
      Since cd758a9b "KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT
      page fault handler", it's been possible in fairly rare circumstances to
      load a non-present PTE in kvmppc_book3s_hv_page_fault() when running a
      guest on a POWER8 host.
      
      Because that case wasn't checked for, we could misinterpret the non-present
      PTE as being a cache-inhibited PTE.  That could mismatch with the
      corresponding hash PTE, which would cause the function to fail with -EFAULT
      a little further down.  That would propagate up to the KVM_RUN ioctl()
      generally causing the KVM userspace (usually qemu) to fall over.
      
      This addresses the problem by catching that case and returning to the guest
      instead.
      
      For completeness, this fixes the radix page fault handler in the same
      way.  For radix this didn't cause any obvious misbehaviour, because we
      ended up putting the non-present PTE into the guest's partition-scoped
      page tables, leading immediately to another hypervisor data/instruction
      storage interrupt, which would go through the page fault path again
      and fix things up.
      
      Fixes: cd758a9b "KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot in HPT page fault handler"
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1820402Reported-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Tested-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      ae49deda
  2. 07 Apr, 2020 8 commits
  3. 03 Apr, 2020 5 commits
  4. 02 Apr, 2020 26 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 8c1b724d
      Linus Torvalds authored
      Pull kvm updates from Paolo Bonzini:
       "ARM:
         - GICv4.1 support
      
         - 32bit host removal
      
        PPC:
         - secure (encrypted) using under the Protected Execution Framework
           ultravisor
      
        s390:
         - allow disabling GISA (hardware interrupt injection) and protected
           VMs/ultravisor support.
      
        x86:
         - New dirty bitmap flag that sets all bits in the bitmap when dirty
           page logging is enabled; this is faster because it doesn't require
           bulk modification of the page tables.
      
         - Initial work on making nested SVM event injection more similar to
           VMX, and less buggy.
      
         - Various cleanups to MMU code (though the big ones and related
           optimizations were delayed to 5.8). Instead of using cr3 in
           function names which occasionally means eptp, KVM too has
           standardized on "pgd".
      
         - A large refactoring of CPUID features, which now use an array that
           parallels the core x86_features.
      
         - Some removal of pointer chasing from kvm_x86_ops, which will also
           be switched to static calls as soon as they are available.
      
         - New Tigerlake CPUID features.
      
         - More bugfixes, optimizations and cleanups.
      
        Generic:
         - selftests: cleanups, new MMU notifier stress test, steal-time test
      
         - CSV output for kvm_stat"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (277 commits)
        x86/kvm: fix a missing-prototypes "vmread_error"
        KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y
        KVM: VMX: Add a trampoline to fix VMREAD error handling
        KVM: SVM: Annotate svm_x86_ops as __initdata
        KVM: VMX: Annotate vmx_x86_ops as __initdata
        KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup()
        KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection
        KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes
        KVM: VMX: Configure runtime hooks using vmx_x86_ops
        KVM: VMX: Move hardware_setup() definition below vmx_x86_ops
        KVM: x86: Move init-only kvm_x86_ops to separate struct
        KVM: Pass kvm_init()'s opaque param to additional arch funcs
        s390/gmap: return proper error code on ksm unsharing
        KVM: selftests: Fix cosmetic copy-paste error in vm_mem_region_move()
        KVM: Fix out of range accesses to memslots
        KVM: X86: Micro-optimize IPI fastpath delay
        KVM: X86: Delay read msr data iff writes ICR MSR
        KVM: PPC: Book3S HV: Add a capability for enabling secure guests
        KVM: arm64: GICv4.1: Expose HW-based SGIs in debugfs
        KVM: arm64: GICv4.1: Allow non-trapping WFI when using HW SGIs
        ...
      8c1b724d
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2020-04-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f14a9532
      Linus Torvalds authored
      Pull x86 fix from Ingo Molnar:
       "A single fix addressing Sparse warnings. <asm/bitops.h> is changed
        non-trivially to avoid the warnings, but generated code is not
        supposed to be affected"
      
      * tag 'x86-urgent-2020-04-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86: Fix bitops.h warning with a moved cast
      f14a9532
    • Linus Torvalds's avatar
      Merge branch 'next-integrity' of... · 7f218319
      Linus Torvalds authored
      Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
      
      Pull integrity updates from Mimi Zohar:
       "Just a couple of updates for linux-5.7:
      
         - A new Kconfig option to enable IMA architecture specific runtime
           policy rules needed for secure and/or trusted boot, as requested.
      
         - Some message cleanup (eg. pr_fmt, additional error messages)"
      
      * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
        ima: add a new CONFIG for loading arch-specific policies
        integrity: Remove duplicate pr_fmt definitions
        IMA: Add log statements for failure conditions
        IMA: Update KBUILD_MODNAME for IMA files to ima
      7f218319
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 6cad420c
      Linus Torvalds authored
      Merge updates from Andrew Morton:
       "A large amount of MM, plenty more to come.
      
        Subsystems affected by this patch series:
         - tools
         - kthread
         - kbuild
         - scripts
         - ocfs2
         - vfs
         - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
               sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
               hugetlbfs, hugetlb"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
        include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
        mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
        selftests/vm: fix map_hugetlb length used for testing read and write
        mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
        mm/hugetlb.c: clean code by removing unnecessary initialization
        hugetlb_cgroup: add hugetlb_cgroup reservation docs
        hugetlb_cgroup: add hugetlb_cgroup reservation tests
        hugetlb: support file_region coalescing again
        hugetlb_cgroup: support noreserve mappings
        hugetlb_cgroup: add accounting for shared mappings
        hugetlb: disable region_add file_region coalescing
        hugetlb_cgroup: add reservation accounting for private mappings
        mm/hugetlb_cgroup: fix hugetlb_cgroup migration
        hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
        hugetlb_cgroup: add hugetlb_cgroup reservation counter
        hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
        hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
        mm/memblock.c: remove redundant assignment to variable max_addr
        mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
        mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
        ...
      6cad420c
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.7-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 7be97138
      Linus Torvalds authored
      Pull xfs updates from Darrick Wong:
       "There's a lot going on this cycle with cleanups in the log code, the
        btree code, and the xattr code.
      
        We're tightening of metadata validation and online fsck checking, and
        introducing a common btree rebuilding library so that we can refactor
        xfs_repair and introduce online repair in a future cycle.
      
        We also fixed a few visible bugs -- most notably there's one in
        getdents that we introduced in 5.6; and a fix for hangs when disabling
        quotas.
      
        This series has been running fstests & other QA in the background for
        over a week and looks good so far.
      
        I anticipate sending a second pull request next week. That batch will
        change how xfs interacts with memory reclaim; how the log batches and
        throttles log items; how hard writes near ENOSPC will try to squeeze
        more space out of the filesystem; and hopefully fix the last of the
        umount hangs after a catastrophic failure. That should ease a lot of
        problems when running at the limits, but for now I'm leaving that in
        for-next for another week to make sure we got all the subtleties
        right.
      
        Summary:
      
         - Fix a hard to trigger race between iclog error checking and log
           shutdown.
      
         - Strengthen the AGF verifier.
      
         - Ratelimit some of the more spammy error messages.
      
         - Remove the icdinode uid/gid members and just use the ones in the
           vfs inode.
      
         - Hold ILOCK across insert/collapse range.
      
         - Clean up the extended attribute interfaces.
      
         - Clean up the attr flags mess.
      
         - Restore PF_MEMALLOC after exiting xfsaild thread to avoid
           triggering warnings in the process accounting code.
      
         - Remove the flexibly-sized array from struct xfs_agfl to eliminate
           compiler warnings about unaligned pointers and packed structures.
      
         - Various macro and typedef removals.
      
         - Stale metadata buffers if we decide they're corrupt outside of a
           verifier.
      
         - Check directory data/block/free block owners.
      
         - Fix a UAF when aborting inactivation of a corrupt xattr fork.
      
         - Teach online scrub to report failed directory and attr name lookups
           as a metadata corruption instead of a runtime error.
      
         - Avoid potential buffer overflows in sysfs files by using scnprintf.
      
         - Fix a regression in getdents lookups due to a mistake in pointer
           arithmetic.
      
         - Refactor btree cursor private data structures to use anonymous
           unions.
      
         - Cleanups in the log unmounting code.
      
         - Fix a potential mishandling of ENOMEM errors on multi-block
           directory buffer lookups.
      
         - Fix an incorrect test in the block allocation code.
      
         - Cleanups and name prefix shortening in the scrub code.
      
         - Introduce btree bulk loading code for online repair and scrub.
      
         - Fix a quotaoff log item leak (and hang) when the fs goes down
           midway through a quotaoff operation.
      
         - Remove di_version from the incore inode.
      
         - Refactor some of the log shutdown checking code.
      
         - Record the forcing of the log unmount records in the log force
           counters.
      
         - Fix a longstanding bug where quotacheck would purge the
           administrator's default quota grace interval and warning limits.
      
         - Reduce memory usage when scrubbing directory and xattr trees.
      
         - Don't let fsfreeze race with GETFSMAP or online scrub.
      
         - Handle bio_add_page failures more gracefully in xlog_write_iclog"
      
      * tag 'xfs-5.7-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (108 commits)
        xfs: prohibit fs freezing when using empty transactions
        xfs: shutdown on failure to add page to log bio
        xfs: directory bestfree check should release buffers
        xfs: drop all altpath buffers at the end of the sibling check
        xfs: preserve default grace interval during quotacheck
        xfs: remove xlog_state_want_sync
        xfs: move the ioerror check out of xlog_state_clean_iclog
        xfs: refactor xlog_state_clean_iclog
        xfs: remove the aborted parameter to xlog_state_done_syncing
        xfs: simplify log shutdown checking in xfs_log_release_iclog
        xfs: simplify the xfs_log_release_iclog calling convention
        xfs: factor out a xlog_wait_on_iclog helper
        xfs: merge xlog_cil_push into xlog_cil_push_work
        xfs: remove the di_version field from struct icdinode
        xfs: simplify a check in xfs_ioctl_setattr_check_cowextsize
        xfs: simplify di_flags2 inheritance in xfs_ialloc
        xfs: only check the superblock version for dinode size calculation
        xfs: add a new xfs_sb_version_has_v3inode helper
        xfs: fix unmount hang and memory leak on shutdown during quotaoff
        xfs: factor out quotaoff intent AIL removal and memory free
        ...
      7be97138
    • Linus Torvalds's avatar
      Merge tag 'vfs-5.7-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 7db83c07
      Linus Torvalds authored
      Pull hibernation fix from Darrick Wong:
       "Fix a regression where we broke the userspace hibernation driver by
        disallowing writes to the swap device"
      
      * tag 'vfs-5.7-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        hibernate: Allow uswsusp to write to swap
      7db83c07
    • Linus Torvalds's avatar
      Merge tag 'iomap-5.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 35a9fafe
      Linus Torvalds authored
      Pull iomap updates from Darrick Wong:
       "We're fixing tracepoints and comments in this cycle, so there
        shouldn't be any surprises here.
      
        I anticipate sending a second pull request next week with a single bug
        fix for readahead, but it's still undergoing QA.
      
        Summary:
      
         - Fix a broken tracepoint
      
         - Fix a broken comment"
      
      * tag 'iomap-5.7-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        iomap: fix comments in iomap_dio_rw
        iomap: Remove pgoff from tracepoints
      35a9fafe
    • Linus Torvalds's avatar
      Merge branch 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 9c577491
      Linus Torvalds authored
      Pull vfs pathwalk sanitizing from Al Viro:
       "Massive pathwalk rewrite and cleanups.
      
        Several iterations have been posted; hopefully this thing is getting
        readable and understandable now. Pretty much all parts of pathname
        resolutions are affected...
      
        The branch is identical to what has sat in -next, except for commit
        message in "lift all calls of step_into() out of follow_dotdot/
        follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
        commit message changed there."
      
      * 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
        lookup_open(): don't bother with fallbacks to lookup+create
        atomic_open(): no need to pass struct open_flags anymore
        open_last_lookups(): move complete_walk() into do_open()
        open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
        open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
        open_last_lookups(): consolidate fsnotify_create() calls
        take post-lookup part of do_last() out of loop
        link_path_walk(): sample parent's i_uid and i_mode for the last component
        __nd_alloc_stack(): make it return bool
        reserve_stack(): switch to __nd_alloc_stack()
        pick_link(): take reserving space on stack into a new helper
        pick_link(): more straightforward handling of allocation failures
        fold path_to_nameidata() into its only remaining caller
        pick_link(): pass it struct path already with normal refcounting rules
        fs/namei.c: kill follow_mount()
        non-RCU analogue of the previous commit
        helper for mount rootwards traversal
        follow_dotdot(): be lazy about changing nd->path
        follow_dotdot_rcu(): be lazy about changing nd->path
        follow_dotdot{,_rcu}(): massage loops
        ...
      9c577491
    • Qian Cai's avatar
      x86/kvm: fix a missing-prototypes "vmread_error" · 514ccc19
      Qian Cai authored
      The commit 842f4be9 ("KVM: VMX: Add a trampoline to fix VMREAD error
      handling") removed the declaration of vmread_error() causes a W=1 build
      failure with KVM_WERROR=y. Fix it by adding it back.
      
      arch/x86/kvm/vmx/vmx.c:359:17: error: no previous prototype for 'vmread_error' [-Werror=missing-prototypes]
       asmlinkage void vmread_error(unsigned long field, bool fault)
                       ^~~~~~~~~~~~
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Message-Id: <20200402153955.1695-1-cai@lca.pw>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      514ccc19
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace · d987ca1c
      Linus Torvalds authored
      Pull exec/proc updates from Eric Biederman:
       "This contains two significant pieces of work: the work to sort out
        proc_flush_task, and the work to solve a deadlock between strace and
        exec.
      
        Fixing proc_flush_task so that it no longer requires a persistent
        mount makes improvements to proc possible. The removal of the
        persistent mount solves an old regression that that caused the hidepid
        mount option to only work on remount not on mount. The regression was
        found and reported by the Android folks. This further allows Alexey
        Gladkov's work making proc mount options specific to an individual
        mount of proc to move forward.
      
        The work on exec starts solving a long standing issue with exec that
        it takes mutexes of blocking userspace applications, which makes exec
        extremely deadlock prone. For the moment this adds a second mutex with
        a narrower scope that handles all of the easy cases. Which makes the
        tricky cases easy to spot. With a little luck the code to solve those
        deadlocks will be ready by next merge window"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
        signal: Extend exec_id to 64bits
        pidfd: Use new infrastructure to fix deadlocks in execve
        perf: Use new infrastructure to fix deadlocks in execve
        proc: io_accounting: Use new infrastructure to fix deadlocks in execve
        proc: Use new infrastructure to fix deadlocks in execve
        kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
        kernel: doc: remove outdated comment cred.c
        mm: docs: Fix a comment in process_vm_rw_core
        selftests/ptrace: add test cases for dead-locks
        exec: Fix a deadlock in strace
        exec: Add exec_update_mutex to replace cred_guard_mutex
        exec: Move exec_mmap right after de_thread in flush_old_exec
        exec: Move cleanup of posix timers on exec out of de_thread
        exec: Factor unshare_sighand out of de_thread and call it separately
        exec: Only compute current once in flush_old_exec
        pid: Improve the comment about waiting in zap_pid_ns_processes
        proc: Remove the now unnecessary internal mount of proc
        uml: Create a private mount of proc for mconsole
        uml: Don't consult current to find the proc_mnt in mconsole_proc
        proc: Use a list of inodes to flush from proc
        ...
      d987ca1c
    • Matthew Wilcox (Oracle)'s avatar
      include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP · 77d6b909
      Matthew Wilcox (Oracle) authored
      It's even more important to check that we don't have a tail page when
      calling hpage_nr_pages() when THP are disabled.
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-4-willy@infradead.orgSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77d6b909
    • Christophe Leroy's avatar
      mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS · bb297bb2
      Christophe Leroy authored
      When CONFIG_HUGETLB_PAGE is set but not CONFIG_HUGETLBFS, the following
      build failure is encoutered:
      
        In file included from arch/powerpc/mm/fault.c:33:0:
        include/linux/hugetlb.h: In function 'hstate_inode':
        include/linux/hugetlb.h:477:9: error: implicit declaration of function 'HUGETLBFS_SB' [-Werror=implicit-function-declaration]
          return HUGETLBFS_SB(i->i_sb)->hstate;
                 ^
        include/linux/hugetlb.h:477:30: error: invalid type argument of '->' (have 'int')
          return HUGETLBFS_SB(i->i_sb)->hstate;
                                      ^
      
      Gate hstate_inode() with CONFIG_HUGETLBFS instead of CONFIG_HUGETLB_PAGE.
      
      Fixes: a137e1cc ("hugetlbfs: per mount huge page sizes")
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Link: http://lkml.kernel.org/r/7e8c3a3c9a587b9cd8a2f146df32a421b961f3a2.1584432148.git.christophe.leroy@c-s.fr
      Link: https://patchwork.ozlabs.org/patch/1255548/#2386036Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb297bb2
    • Christophe Leroy's avatar
      selftests/vm: fix map_hugetlb length used for testing read and write · cabc30da
      Christophe Leroy authored
      Commit fa7b9a80 ("tools/selftest/vm: allow choosing mem size and page
      size in map_hugetlb") added the possibility to change the size of memory
      mapped for the test, but left the read and write test using the default
      value.  This is unnoticed when mapping a length greater than the default
      one, but segfaults otherwise.
      
      Fix read_bytes() and write_bytes() by giving them the real length.
      
      Also fix the call to munmap().
      
      Fixes: fa7b9a80 ("tools/selftest/vm: allow choosing mem size and page size in map_hugetlb")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarLeonardo Bras <leonardo@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/9a404a13c871c4bd0ba9ede68f69a1225180dd7e.1580978385.git.christophe.leroy@c-s.frSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cabc30da
    • Vlastimil Babka's avatar
      mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() · d4af73e3
      Vlastimil Babka authored
      Commit f1e61557 ("mm: pack compound_dtor and compound_order into one
      word in struct page") changed compound_dtor from a pointer to an array
      index in order to pack it.  To check if page has the hugeltbfs
      compound_dtor, we can just compare the index directly without fetching the
      function pointer.  Said commit did that with PageHuge() and we can do the
      same with PageHeadHuge() to make the code a bit smaller and faster.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Neha Agarwal <nehaagarwal@google.com>
      Link: http://lkml.kernel.org/r/20200311172440.6988-1-vbabka@suse.czSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4af73e3
    • Mateusz Nosek's avatar
      mm/hugetlb.c: clean code by removing unnecessary initialization · 353b2de4
      Mateusz Nosek authored
      Previously variable 'check_addr' was initialized, but was not read later
      before reassigning.  So the initialization can be removed.
      Signed-off-by: default avatarMateusz Nosek <mateusznosek0@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Link: http://lkml.kernel.org/r/20200303212354.25226-1-mateusznosek0@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      353b2de4
    • Mina Almasry's avatar
      hugetlb_cgroup: add hugetlb_cgroup reservation docs · 6566704d
      Mina Almasry authored
      Add docs for how to use hugetlb_cgroup reservations, and their behavior.
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-9-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6566704d
    • Mina Almasry's avatar
      hugetlb_cgroup: add hugetlb_cgroup reservation tests · 29750f71
      Mina Almasry authored
      The tests use both shared and private mapped hugetlb memory, and monitors
      the hugetlb usage counter as well as the hugetlb reservation counter.
      They test different configurations such as hugetlb memory usage via
      hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
      MAP_POPULATE.
      
      Also add test for hugetlb reservation reparenting, since this is a subtle
      issue.
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Sandipan Das <sandipan@linux.ibm.com>	[powerpc64]
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-8-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29750f71
    • Mina Almasry's avatar
      hugetlb: support file_region coalescing again · a9b3f867
      Mina Almasry authored
      An earlier patch in this series disabled file_region coalescing in order
      to hang the hugetlb_cgroup uncharge info on the file_region entries.
      
      This patch re-adds support for coalescing of file_region entries.
      Essentially everytime we add an entry, we call a recursive function that
      tries to coalesce the added region with the regions next to it.  The worst
      case call depth for this function is 3: one to coalesce with the region
      next to it, one to coalesce to the region prev, and one to reach the base
      case.
      
      This is an important performance optimization as private mappings add
      their entries page by page, and we could incur big performance costs for
      large mappings with lots of file_region entries in their resv_map.
      
      [almasrymina@google.com: fix CONFIG_CGROUP_HUGETLB ifdefs]
        Link: http://lkml.kernel.org/r/20200214204544.231482-1-almasrymina@google.com
      [almasrymina@google.com: remove check_coalesce_bug debug code]
        Link: http://lkml.kernel.org/r/20200219233610.13808-1-almasrymina@google.comSigned-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-7-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9b3f867
    • Mina Almasry's avatar
      hugetlb_cgroup: support noreserve mappings · 08cf9faf
      Mina Almasry authored
      Support MAP_NORESERVE accounting as part of the new counter.
      
      For each hugepage allocation, at allocation time we check if there is a
      reservation for this allocation or not.  If there is a reservation for
      this allocation, then this allocation was charged at reservation time, and
      we don't re-account it.  If there is no reserevation for this allocation,
      we charge the appropriate hugetlb_cgroup.
      
      The hugetlb_cgroup to uncharge for this allocation is stored in
      page[3].private.  We use new APIs added in an earlier patch to set this
      pointer.
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-6-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08cf9faf
    • Mina Almasry's avatar
      hugetlb_cgroup: add accounting for shared mappings · 075a61d0
      Mina Almasry authored
      For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
      in the resv_map entries, in file_region->reservation_counter.
      
      After a call to region_chg, we charge the approprate hugetlb_cgroup, and
      if successful, we pass on the hugetlb_cgroup info to a follow up
      region_add call.  When a file_region entry is added to the resv_map via
      region_add, we put the pointer to that cgroup in
      file_region->reservation_counter.  If charging doesn't succeed, we report
      the error to the caller, so that the kernel fails the reservation.
      
      On region_del, which is when the hugetlb memory is unreserved, we also
      uncharge the file_region->reservation_counter.
      
      [akpm@linux-foundation.org: forward declare struct file_region]
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      075a61d0
    • Mina Almasry's avatar
      hugetlb: disable region_add file_region coalescing · 0db9d74e
      Mina Almasry authored
      A follow up patch in this series adds hugetlb cgroup uncharge info the
      file_region entries in resv->regions.  The cgroup uncharge info may differ
      for different regions, so they can no longer be coalesced at region_add
      time.  So, disable region coalescing in region_add in this patch.
      
      Behavior change:
      
      Say a resv_map exists like this [0->1], [2->3], and [5->6].
      
      Then a region_chg/add call comes in region_chg/add(f=0, t=5).
      
      Old code would generate resv->regions: [0->5], [5->6].
      New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
      [5->6].
      
      Special care needs to be taken to handle the resv->adds_in_progress
      variable correctly.  In the past, only 1 region would be added for every
      region_chg and region_add call.  But now, each call may add multiple
      regions, so we can no longer increment adds_in_progress by 1 in
      region_chg, or decrement adds_in_progress by 1 after region_add or
      region_abort.  Instead, region_chg calls add_reservation_in_range() to
      count the number of regions needed and allocates those, and that info is
      passed to region_add and region_abort to decrement adds_in_progress
      correctly.
      
      We've also modified the assumption that region_add after region_chg never
      fails.  region_chg now pre-allocates at least 1 region for region_add.  If
      region_add needs more regions than region_chg has allocated for it, then
      it may fail.
      
      [almasrymina@google.com: fix file_region entry allocations]
        Link: http://lkml.kernel.org/r/20200219012736.20363-1-almasrymina@google.comSigned-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Link: http://lkml.kernel.org/r/20200211213128.73302-4-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0db9d74e
    • Mina Almasry's avatar
      hugetlb_cgroup: add reservation accounting for private mappings · e9fe92ae
      Mina Almasry authored
      Normally the pointer to the cgroup to uncharge hangs off the struct page,
      and gets queried when it's time to free the page.  With hugetlb_cgroup
      reservations, this is not possible.  Because it's possible for a page to
      be reserved by one task and actually faulted in by another task.
      
      The best place to put the hugetlb_cgroup pointer to uncharge for
      reservations is in the resv_map.  But, because the resv_map has different
      semantics for private and shared mappings, the code patch to
      charge/uncharge shared and private mappings is different.  This patch
      implements charging and uncharging for private mappings.
      
      For private mappings, the counter to uncharge is in
      resv_map->reservation_counter.  On initializing the resv_map this is set
      to NULL.  On reservation of a region in private mapping, the tasks
      hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
      resv_map->reservation_counter.
      
      On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.
      
      [akpm@linux-foundation.org: forward declare struct resv_map]
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9fe92ae
    • Mina Almasry's avatar
      mm/hugetlb_cgroup: fix hugetlb_cgroup migration · 9808895e
      Mina Almasry authored
      Commit c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge
      hugetlb reservations") mistakingly doesn't handle the migration of *both*
      the reservation hugetlb_cgroup and the fault hugetlb_cgroup correctly.
      
      What should happen is that both cgroups shuold be queried from the old
      page, then both set to NULL on the old page, then both inserted into the
      new page.
      
      The mistake also creates the following warning:
      
      mm/hugetlb_cgroup.c: In function 'hugetlb_cgroup_migrate':
      mm/hugetlb_cgroup.c:777:25: warning: variable 'h_cg' set but not used
      [-Wunused-but-set-variable]
        struct hugetlb_cgroup *h_cg;
                               ^~~~
      
      Solution is to add the missing steps, namly setting the reservation
      hugetlb_cgroup to NULL on the old page, and setting the fault
      hugetlb_cgroup on the new page.
      
      Fixes: c32300516047 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200218194727.46995-1-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9808895e
    • Mina Almasry's avatar
      hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations · 1adc4d41
      Mina Almasry authored
      Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
      or hugetlb reservation counter.
      
      Adds a new interface to uncharge a hugetlb_cgroup counter via
      hugetlb_cgroup_uncharge_counter.
      
      Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
      hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1adc4d41
    • Mina Almasry's avatar
      hugetlb_cgroup: add hugetlb_cgroup reservation counter · cdc2fcfe
      Mina Almasry authored
      These counters will track hugetlb reservations rather than hugetlb memory
      faulted in.  This patch only adds the counter, following patches add the
      charging and uncharging of the counter.
      
      This is patch 1 of an 9 patch series.
      
      Problem:
      
      Currently tasks attempting to reserve more hugetlb memory than is
      available get a failure at mmap/shmget time.  This is thanks to Hugetlbfs
      Reservations [1].  However, if a task attempts to reserve more hugetlb
      memory than its hugetlb_cgroup limit allows, the kernel will allow the
      mmap/shmget call, but will SIGBUS the task when it attempts to fault in
      the excess memory.
      
      We have users hitting their hugetlb_cgroup limits and thus we've been
      looking at this failure mode.  We'd like to improve this behavior such
      that users violating the hugetlb_cgroup limits get an error on mmap/shmget
      time, rather than getting SIGBUS'd when they try to fault the excess
      memory in.  This gives the user an opportunity to fallback more gracefully
      to non-hugetlbfs memory for example.
      
      The underlying problem is that today's hugetlb_cgroup accounting happens
      at hugetlb memory *fault* time, rather than at *reservation* time.  Thus,
      enforcing the hugetlb_cgroup limit only happens at fault time, and the
      offending task gets SIGBUS'd.
      
      Proposed Solution:
      
      A new page counter named
      'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
      slightly different semantics than
      'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
      
      - While usage_in_bytes tracks all *faulted* hugetlb memory,
        rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
        memory faulted in without a prior reservation.
      
      - If a task attempts to reserve more memory than limit_in_bytes allows,
        the kernel will allow it to do so.  But if a task attempts to reserve
        more memory than rsvd.limit_in_bytes, the kernel will fail this
        reservation.
      
      This proposal is implemented in this patch series, with tests to verify
      functionality and show the usage.
      
      Alternatives considered:
      
      1. A new cgroup, instead of only a new page_counter attached to the
         existing hugetlb_cgroup.  Adding a new cgroup seemed like a lot of code
         duplication with hugetlb_cgroup.  Keeping hugetlb related page counters
         under hugetlb_cgroup seemed cleaner as well.
      
      2. Instead of adding a new counter, we considered adding a sysctl that
         modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
         accounting at reservation time rather than fault time.  Adding a new
         page_counter seems better as userspace could, if it wants, choose to
         enforce different cgroups differently: one via limit_in_bytes, and
         another via rsvd.limit_in_bytes.  This could be very useful if you're
         transitioning how hugetlb memory is partitioned on your system one
         cgroup at a time, for example.  Also, someone may find usage for both
         limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
         gives them the option to do so.
      
      Testing:
      - Added tests passing.
      - Used libhugetlbfs for regression testing.
      
      [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.htmlSigned-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdc2fcfe
    • Mike Kravetz's avatar
      hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race · 87bf91d3
      Mike Kravetz authored
      hugetlbfs page faults can race with truncate and hole punch operations.
      Current code in the page fault path attempts to handle this by 'backing
      out' operations if we encounter the race.  One obvious omission in the
      current code is removing a page newly added to the page cache.  This is
      pretty straight forward to address, but there is a more subtle and
      difficult issue of backing out hugetlb reservations.  To handle this
      correctly, the 'reservation state' before page allocation needs to be
      noted so that it can be properly backed out.  There are four distinct
      possibilities for reservation state: shared/reserved, shared/no-resv,
      private/reserved and private/no-resv.  Backing out a reservation may
      require memory allocation which could fail so that needs to be taken
      into account as well.
      
      Instead of writing the required complicated code for this rare
      occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
      mode for the duration of page fault processing.  Hold i_mmap_rwsem in
      write mode when modifying i_size.  In this way, truncation can not
      proceed when page faults are being processed.  In addition, i_size
      will not change during fault processing so a single check can be made
      to ensure faults are not beyond (proposed) end of file.  Faults can
      still race with hole punch, but that race is handled by existing code
      and the use of hugetlb_fault_mutex.
      
      With this modification, checks for races with truncation in the page
      fault path can be simplified and removed.  remove_inode_hugepages no
      longer needs to take hugetlb_fault_mutex in the case of truncation.
      Comments are expanded to explain reasoning behind locking.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-3-mike.kravetz@oracle.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87bf91d3