1. 20 Jul, 2019 4 commits
    • Wanpeng Li's avatar
      KVM: Boost vCPUs that are delivering interrupts · d73eb57b
      Wanpeng Li authored
      Inspired by commit 9cac38dd (KVM/s390: Set preempted flag during
      vcpu wakeup and interrupt delivery), we want to also boost not just
      lock holders but also vCPUs that are delivering interrupts. Most
      smp_call_function_many calls are synchronous, so the IPI target vCPUs
      are also good yield candidates.  This patch introduces vcpu->ready to
      boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
      not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
      into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
      (VT-d PI handles voluntary preemption separately, in pi_pre_block).
      
      Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
      ebizzy -M
      
                  vanilla     boosting    improved
      1VM          21443       23520         9%
      2VM           2800        8000       180%
      3VM           1800        3100        72%
      
      Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
      one running ebizzy -M, the other running 'stress --cpu 2':
      
      w/ boosting + w/o pv sched yield(vanilla)
      
                  vanilla     boosting   improved
                    1570         4000      155%
      
      w/ boosting + w/ pv sched yield(vanilla)
      
                  vanilla     boosting   improved
                    1844         5157      179%
      
      w/o boosting, perf top in VM:
      
       72.33%  [kernel]       [k] smp_call_function_many
        4.22%  [kernel]       [k] call_function_i
        3.71%  [kernel]       [k] async_page_fault
      
      w/ boosting, perf top in VM:
      
       38.43%  [kernel]       [k] smp_call_function_many
        6.31%  [kernel]       [k] async_page_fault
        6.13%  libc-2.23.so   [.] __memcpy_avx_unaligned
        4.88%  [kernel]       [k] call_function_interrupt
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d73eb57b
    • Thomas Huth's avatar
      KVM: selftests: Remove superfluous define from vmx.c · 2417c870
      Thomas Huth authored
      The code in vmx.c does not use "program_invocation_name", so there
      is no need to "#define _GNU_SOURCE" here.
      Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2417c870
    • Liran Alon's avatar
      KVM: SVM: Fix detection of AMD Errata 1096 · 118154bd
      Liran Alon authored
      When CPU raise #NPF on guest data access and guest CR4.SMAP=1, it is
      possible that CPU microcode implementing DecodeAssist will fail
      to read bytes of instruction which caused #NPF. This is AMD errata
      1096 and it happens because CPU microcode reading instruction bytes
      incorrectly attempts to read code as implicit supervisor-mode data
      accesses (that is, just like it would read e.g. a TSS), which are
      susceptible to SMAP faults. The microcode reads CS:RIP and if it is
      a user-mode address according to the page tables, the processor
      gives up and returns no instruction bytes.  In this case,
      GuestIntrBytes field of the VMCB on a VMEXIT will incorrectly
      return 0 instead of the correct guest instruction bytes.
      
      Current KVM code attemps to detect and workaround this errata, but it
      has multiple issues:
      
      1) It mistakenly checks if guest CR4.SMAP=0 instead of guest CR4.SMAP=1,
      which is required for encountering a SMAP fault.
      
      2) It assumes SMAP faults can only occur when guest CPL==3.
      However, in case guest CR4.SMEP=0, the guest can execute an instruction
      which reside in a user-accessible page with CPL<3 priviledge. If this
      instruction raise a #NPF on it's data access, then CPU DecodeAssist
      microcode will still encounter a SMAP violation.  Even though no sane
      OS will do so (as it's an obvious priviledge escalation vulnerability),
      we still need to handle this semanticly correct in KVM side.
      
      Note that (2) *is* a useful optimization, because CR4.SMAP=1 is an easy
      triggerable condition and guests usually enable SMAP together with SMEP.
      If the vCPU has CR4.SMEP=1, the errata could indeed be encountered onlt
      at guest CPL==3; otherwise, the CPU would raise a SMEP fault to guest
      instead of #NPF.  We keep this condition to avoid false positives in
      the detection of the errata.
      
      In addition, to avoid future confusion and improve code readbility,
      include details of the errata in code and not just in commit message.
      
      Fixes: 05d5a486 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)")
      Cc: Singh Brijesh <brijesh.singh@amd.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      118154bd
    • Wanpeng Li's avatar
      KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da
      Wanpeng Li authored
      Dedicated instances are currently disturbed by unnecessary jitter due
      to the emulated lapic timers firing on the same pCPUs where the
      vCPUs reside.  There is no hardware virtual timer on Intel for guest
      like ARM, so both programming timer in guest and the emulated timer fires
      incur vmexits.  This patch tries to avoid vmexit when the emulated timer
      fires, at least in dedicated instance scenario when nohz_full is enabled.
      
      In that case, the emulated timers can be offload to the nearest busy
      housekeeping cpus since APICv has been found for several years in server
      processors. The guest timer interrupt can then be injected via posted interrupts,
      which are delivered by the housekeeping cpu once the emulated timer fires.
      
      The host should tuned so that vCPUs are placed on isolated physical
      processors, and with several pCPUs surplus for busy housekeeping.
      If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
      ~3% redis performance benefit can be observed on Skylake server, and the
      number of external interrupt vmexits drops substantially.  Without patch
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
      EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
      
      While with patch:
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
      EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0c5f81da
  2. 17 Jul, 2019 2 commits
  3. 15 Jul, 2019 8 commits
  4. 13 Jul, 2019 4 commits
    • Linus Torvalds's avatar
      Merge tag 'dlm-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm · 964a4eac
      Linus Torvalds authored
      Pull dlm updates from David Teigland:
       "This set removes some unnecessary debugfs error handling, and checks
        that lowcomms workqueues are not NULL before destroying"
      
      * tag 'dlm-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
        dlm: no need to check return value of debugfs_create functions
        dlm: check if workqueues are NULL before flushing/destroying
      964a4eac
    • Linus Torvalds's avatar
      Merge tag '9p-for-5.3' of git://github.com/martinetd/linux · 23bbbf5c
      Linus Torvalds authored
      Pull 9p updates from Dominique Martinet:
       "Two small fixes to properly cleanup the 9p transports list if
        virtio/xen module initialization fail.
      
        9p might otherwise try to access memory from a module that failed to
        register got freed"
      
      * tag '9p-for-5.3' of git://github.com/martinetd/linux:
        9p/xen: Add cleanup path in p9_trans_xen_init
        9p/virtio: Add cleanup path in p9_virtio_init
      23bbbf5c
    • Linus Torvalds's avatar
      Merge tag 'f2fs-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs · a641a88e
      Linus Torvalds authored
      Pull f2fs updates from Jaegeuk Kim:
       "In this round, we've introduced native swap file support which can
        exploit DIO, enhanced existing checkpoint=disable feature with
        additional mount option to tune the triggering condition, and allowed
        user to preallocate physical blocks in a pinned file which will be
        useful to avoid f2fs fragmentation in append-only workloads. In
        addition, we've fixed subtle quota corruption issue.
      
        Enhancements:
         - add swap file support which uses DIO
         - allocate blocks for pinned file
         - allow SSR and mount option to enhance checkpoint=disable
         - enhance IPU IOs
         - add more sanity checks such as memory boundary access
      
        Bug fixes:
         - quota corruption in very corner case of error-injected SPO case
         - fix root_reserved on remount and some wrong counts
         - add missing fsck flag
      
        Some patches were also introduced to clean up ambiguous i_flags and
        debugging messages codes"
      
      * tag 'f2fs-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (33 commits)
        f2fs: improve print log in f2fs_sanity_check_ckpt()
        f2fs: avoid out-of-range memory access
        f2fs: fix to avoid long latency during umount
        f2fs: allow all the users to pin a file
        f2fs: support swap file w/ DIO
        f2fs: allocate blocks for pinned file
        f2fs: fix is_idle() check for discard type
        f2fs: add a rw_sem to cover quota flag changes
        f2fs: set SBI_NEED_FSCK for xattr corruption case
        f2fs: use generic EFSBADCRC/EFSCORRUPTED
        f2fs: Use DIV_ROUND_UP() instead of open-coding
        f2fs: print kernel message if filesystem is inconsistent
        f2fs: introduce f2fs_<level> macros to wrap f2fs_printk()
        f2fs: avoid get_valid_blocks() for cleanup
        f2fs: ioctl for removing a range from F2FS
        f2fs: only set project inherit bit for directory
        f2fs: separate f2fs i_flags from fs_flags and ext4 i_flags
        f2fs: replace ktype default_attrs with default_groups
        f2fs: Add option to limit required GC for checkpoint=disable
        f2fs: Fix accounting for unusable blocks
        ...
      a641a88e
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.3-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 4ce9d181
      Linus Torvalds authored
      Pull xfs updates from Darrick Wong:
       "In this release there are a significant amounts of consolidations and
        cleanups in the log code; restructuring of the log to issue struct
        bios directly; new bulkstat ioctls to return v5 fs inode information
        (and fix all the padding problems of the old ioctl); the beginnings of
        multithreaded inode walks (e.g. quotacheck); and a reduction in memory
        usage in the online scrub code leading to reduced runtimes.
      
         - Refactor inode geometry calculation into a single structure instead
           of open-coding pieces everywhere.
      
         - Add online repair to build options.
      
         - Remove unnecessary function call flags and functions.
      
         - Claim maintainership of various loose xfs documentation and header
           files.
      
         - Use struct bio directly for log buffer IOs instead of struct
           xfs_buf.
      
         - Reduce log item boilerplate code requirements.
      
         - Merge log item code spread across too many files.
      
         - Further distinguish between log item commits and cancellations.
      
         - Various small cleanups to the ag small allocator.
      
         - Support cgroup-aware writeback
      
         - libxfs refactoring for mkfs cleanup
      
         - Remove unneeded #includes
      
         - Fix a memory allocation miscalculation in the new log bio code
      
         - Fix bisection problems
      
         - Fix a crash in ioend processing caused by tripping over freeing of
           preallocated transactions
      
         - Split out a generic inode walk mechanism from the bulkstat code,
           hook up all the internal users to use the walking code, then clean
           up bulkstat to serve only the bulkstat ioctls.
      
         - Add a multithreaded iwalk implementation to speed up quotacheck on
           fast storage with many CPUs.
      
         - Remove unnecessary return values in logging teardown functions.
      
         - Supplement the bstat and inogrp structures with new bulkstat and
           inumbers structures that have all the fields we need for v5
           filesystem features and none of the padding problems of their
           predecessors.
      
         - Wire up new ioctls that use the new structures with a much simpler
           bulk_ireq structure at the head instead of the pointerhappy mess we
           had before.
      
         - Enable userspace to constrain bulkstat returns to a single AG or a
           single special inode so that we can phase out a lot of geometry
           guesswork in userspace.
      
         - Reduce memory consumption and zeroing overhead in extended
           attribute scrub code.
      
         - Fix some behavioral regressions in the new bulkstat backend code.
      
         - Fix some behavioral regressions in the new log bio code"
      
      * tag 'xfs-5.3-merge-12' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (100 commits)
        xfs: chain bios the right way around in xfs_rw_bdev
        xfs: bump INUMBERS cursor correctly in xfs_inumbers_walk
        xfs: don't update lastino for FSBULKSTAT_SINGLE
        xfs: online scrub needn't bother zeroing its temporary buffer
        xfs: only allocate memory for scrubbing attributes when we need it
        xfs: refactor attr scrub memory allocation function
        xfs: refactor extended attribute buffer pointer functions
        xfs: attribute scrub should use seen_enough to pass error values
        xfs: allow single bulkstat of special inodes
        xfs: specify AG in bulk req
        xfs: wire up the v5 inumbers ioctl
        xfs: wire up new v5 bulkstat ioctls
        xfs: introduce v5 inode group structure
        xfs: introduce new v5 bulkstat structure
        xfs: rename bulkstat functions
        xfs: remove various bulk request typedef usage
        fs: xfs: xfs_log: Change return type from int to void
        xfs: poll waiting for quotacheck
        xfs: multithreaded iwalk implementation
        xfs: refactor INUMBERS to use iwalk functions
        ...
      4ce9d181
  5. 12 Jul, 2019 22 commits
    • Linus Torvalds's avatar
      Merge tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 5010fe9f
      Linus Torvalds authored
      Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
       "Here's a patch series that sets up common parameter checking functions
        for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.
      
        The goal here is to reduce the amount of behaviorial variance between
        the filesystems where those ioctls originated (ext2 and XFS,
        respectively) and everybody else.
      
         - Standardize parameter checking for the SETFLAGS and FSSETXATTR
           ioctls (which were the file attribute setters for ext4 and xfs and
           have now been hoisted to the vfs)
      
         - Only allow the DAX flag to be set on files and directories"
      
      * tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        vfs: only allow FSSETXATTR to set DAX flag on files and dirs
        vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
        vfs: teach vfs_ioc_fssetxattr_check to check project id info
        vfs: create a generic checking function for FS_IOC_FSSETXATTR
        vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
      5010fe9f
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-5.3-rc1' of... · 8487d822
      Linus Torvalds authored
      Merge tag 'linux-kselftest-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest updates from Shuah Khan:
       "This Kselftest update for Linux 5.3-rc1 consists of build failure
        fixes and minor code cleaning patch to remove duplicate headers"
      
      * tag 'linux-kselftest-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        rseq/selftests: Fix Thumb mode build failure on arm32
        kselftests: cgroup: remove duplicated include from test_freezer.c
        selftests: timestamping: Fix SIOCGSTAMP undeclared build failure
        selftests: dma-buf: Adding kernel config fragment CONFIG_UDMABUF=y
      8487d822
    • Linus Torvalds's avatar
      Merge tag 'kconfig-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · 106f1466
      Linus Torvalds authored
      Pull Kconfig updates from Masahiro Yamada:
      
       - always require argument for --defconfig and remove the hard-coded
         arch/$(ARCH)/defconfig path
      
       - make arch/$(SRCARCH)/configs/defconfig the new default of defconfig
      
       - some code cleanups
      
      * tag 'kconfig-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kconfig: remove meaningless if-conditional in conf_read()
        kconfig: Fix spelling of sym_is_changable
        unicore32: rename unicore32_defconfig to defconfig
        kconfig: make arch/*/configs/defconfig the default of KBUILD_DEFCONFIG
        kconfig: add static qualifier to expand_string()
        kconfig: require the argument of --defconfig
        kconfig: remove always false ifeq ($(KBUILD_DEFCONFIG,) conditional
      106f1466
    • Linus Torvalds's avatar
      Merge tag 'kbuild-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · 39ceda5c
      Linus Torvalds authored
      Pull Kbuild updates from Masahiro Yamada:
      
       - remove headers_{install,check}_all targets
      
       - remove unreasonable 'depends on !UML' from CONFIG_SAMPLES
      
       - re-implement 'make headers_install' more cleanly
      
       - add new header-test-y syntax to compile-test headers
      
       - compile-test exported headers to ensure they are compilable in
         user-space
      
       - compile-test headers under include/ to ensure they are self-contained
      
       - remove -Waggregate-return, -Wno-uninitialized, -Wno-unused-value
         flags
      
       - add -Werror=unknown-warning-option for Clang
      
       - add 128-bit built-in types support to genksyms
      
       - fix missed rebuild of modules.builtin
      
       - propagate 'No space left on device' error in fixdep to Make
      
       - allow Clang to use its integrated assembler
      
       - improve some coccinelle scripts
      
       - add a new flag KBUILD_ABS_SRCTREE to request Kbuild to use absolute
         path for $(srctree).
      
       - do not ignore errors when compression utility is missing
      
       - misc cleanups
      
      * tag 'kbuild-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (49 commits)
        kbuild: use -- separater intead of $(filter-out ...) for cc-cross-prefix
        kbuild: Inform user to pass ARCH= for make mrproper
        kbuild: fix compression errors getting ignored
        kbuild: add a flag to force absolute path for srctree
        kbuild: replace KBUILD_SRCTREE with boolean building_out_of_srctree
        kbuild: remove src and obj from the top Makefile
        scripts/tags.sh: remove unused environment variables from comments
        scripts/tags.sh: drop SUBARCH support for ARM
        kbuild: compile-test kernel headers to ensure they are self-contained
        kheaders: include only headers into kheaders_data.tar.xz
        kheaders: remove meaningless -R option of 'ls'
        kbuild: support header-test-pattern-y
        kbuild: do not create wrappers for header-test-y
        kbuild: compile-test exported headers to ensure they are self-contained
        init/Kconfig: add CONFIG_CC_CAN_LINK
        kallsyms: exclude kasan local symbols on s390
        kbuild: add more hints about SUBDIRS replacement
        coccinelle: api/stream_open: treat all wait_.*() calls as blocking
        coccinelle: put_device: Add a cast to an expression for an assignment
        coccinelle: put_device: Adjust a message construction
        ...
      39ceda5c
    • Linus Torvalds's avatar
      Merge tag 'asm-generic-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic · 5f26f114
      Linus Torvalds authored
      Pull asm-generic updates from Arnd Bergmann:
       "The asm-generic changes for 5.3 consist of a cleanup series to remove
        ptrace.h from Christoph Hellwig, who explains:
      
          'asm-generic/ptrace.h is a little weird in that it doesn't actually
           implement any functionality, but it provided multiple layers of
           macros that just implement trivial inline functions. We implement
           those directly in the few architectures and be off with a much
           simpler design.'
      
        at https://lore.kernel.org/lkml/20190624054728.30966-1-hch@lst.de/"
      
      * tag 'asm-generic-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
        asm-generic: remove ptrace.h
        x86: don't use asm-generic/ptrace.h
        sh: don't use asm-generic/ptrace.h
        powerpc: don't use asm-generic/ptrace.h
        arm64: don't use asm-generic/ptrace.h
      5f26f114
    • Linus Torvalds's avatar
      Merge tag 's390-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · aabfea8d
      Linus Torvalds authored
      Pull more s390 updates from Vasily Gorbik:
      
       - Fix integer overflow during stack frame unwind with invalid
         backchain.
      
       - Cleanup unused symbol export in zcrypt code.
      
       - Fix MIO addressing control activation in PCI code and expose its
         usage via sysfs.
      
       - Fix kernel image signature verification report presence detection.
      
       - Fix irq registration in vfio-ap code.
      
       - Add CPU measurement counters for newer machines.
      
       - Add base DASD thin provisioning support and code cleanups.
      
      * tag 's390-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (21 commits)
        s390/unwind: avoid int overflow in outside_of_stack
        s390/zcrypt: remove the exporting of ap_query_configuration
        s390/pci: add mio_enabled attribute
        s390: fix setting of mio addressing control
        s390/ipl: Fix detection of has_secure attribute
        s390: vfio-ap: fix irq registration
        s390/cpumf: Add extended counter set definitions for model 8561 and 8562
        s390/dasd: Handle out-of-space constraint
        s390/dasd: Add discard support for ESE volumes
        s390/dasd: Use ALIGN_DOWN macro
        s390/dasd: Make dasd_setup_queue() a discipline function
        s390/dasd: Add new ioctl to release space
        s390/dasd: Add dasd_sleep_on_queue_interruptible()
        s390/dasd: Add missing intensity definition
        s390/dasd: Fix whitespace
        s390/dasd: Add dynamic formatting support for ESE volumes
        s390/dasd: Recognise data for ESE volumes
        s390/dasd: Put sub-order definitions in a separate section
        s390/dasd: Make layout analysis ESE compatible
        s390/dasd: Remove old defines and function
        ...
      aabfea8d
    • Linus Torvalds's avatar
      Merge tag 'nios2-v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 · 7181feb9
      Linus Torvalds authored
      Pull arch/nios2 updates from Ley Foon Tan.
      
      * tag 'nios2-v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
        nios2: configs: Remove useless UEVENT_HELPER_PATH
        nios2: remove pointless second entry for CONFIG_TRACE_IRQFLAGS_SUPPORT
      7181feb9
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 39d7530d
      Linus Torvalds authored
      Pull KVM updates from Paolo Bonzini:
       "ARM:
         - support for chained PMU counters in guests
         - improved SError handling
         - handle Neoverse N1 erratum #1349291
         - allow side-channel mitigation status to be migrated
         - standardise most AArch64 system register accesses to msr_s/mrs_s
         - fix host MPIDR corruption on 32bit
         - selftests ckleanups
      
        x86:
         - PMU event {white,black}listing
         - ability for the guest to disable host-side interrupt polling
         - fixes for enlightened VMCS (Hyper-V pv nested virtualization),
         - new hypercall to yield to IPI target
         - support for passing cstate MSRs through to the guest
         - lots of cleanups and optimizations
      
        Generic:
         - Some txt->rST conversions for the documentation"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (128 commits)
        Documentation: virtual: Add toctree hooks
        Documentation: kvm: Convert cpuid.txt to .rst
        Documentation: virtual: Convert paravirt_ops.txt to .rst
        KVM: x86: Unconditionally enable irqs in guest context
        KVM: x86: PMU Event Filter
        kvm: x86: Fix -Wmissing-prototypes warnings
        KVM: Properly check if "page" is valid in kvm_vcpu_unmap
        KVM: arm/arm64: Initialise host's MPIDRs by reading the actual register
        KVM: LAPIC: Retry tune per-vCPU timer_advance_ns if adaptive tuning goes insane
        kvm: LAPIC: write down valid APIC registers
        KVM: arm64: Migrate _elx sysreg accessors to msr_s/mrs_s
        KVM: doc: Add API documentation on the KVM_REG_ARM_WORKAROUNDS register
        KVM: arm/arm64: Add save/restore support for firmware workaround state
        arm64: KVM: Propagate full Spectre v2 workaround state to KVM guests
        KVM: arm/arm64: Support chained PMU counters
        KVM: arm/arm64: Remove pmc->bitmask
        KVM: arm/arm64: Re-create event when setting counter value
        KVM: arm/arm64: Extract duplicated code to own function
        KVM: arm/arm64: Rename kvm_pmu_{enable/disable}_counter functions
        KVM: LAPIC: ARBPRI is a reserved register for x2APIC
        ...
      39d7530d
    • Linus Torvalds's avatar
      Merge tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux · 16c97650
      Linus Torvalds authored
      Pull hyper-v updates from Sasha Levin:
      
       - Add a module description to the Hyper-V vmbus module.
      
       - Rework some vmbus code to separate architecture specifics out to
         arch/x86/. This is part of the work of adding arm64 support to
         Hyper-V.
      
      * tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
        Drivers: hv: vmbus: Break out ISA independent parts of mshyperv.h
        drivers: hv: Add a module description line to the hv_vmbus driver
      16c97650
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.3' of git://git.infradead.org/users/hch/dma-mapping · 9e3a25dc
      Linus Torvalds authored
      Pull dma-mapping updates from Christoph Hellwig:
      
       - move the USB special case that bounced DMA through a device bar into
         the USB code instead of handling it in the common DMA code (Laurentiu
         Tudor and Fredrik Noring)
      
       - don't dip into the global CMA pool for single page allocations
         (Nicolin Chen)
      
       - fix a crash when allocating memory for the atomic pool failed during
         boot (Florian Fainelli)
      
       - move support for MIPS-style uncached segments to the common code and
         use that for MIPS and nios2 (me)
      
       - make support for DMA_ATTR_NON_CONSISTENT and
         DMA_ATTR_NO_KERNEL_MAPPING generic (me)
      
       - convert nds32 to the generic remapping allocator (me)
      
      * tag 'dma-mapping-5.3' of git://git.infradead.org/users/hch/dma-mapping: (29 commits)
        dma-mapping: mark dma_alloc_need_uncached as __always_inline
        MIPS: only select ARCH_HAS_UNCACHED_SEGMENT for non-coherent platforms
        usb: host: Fix excessive alignment restriction for local memory allocations
        lib/genalloc.c: Add algorithm, align and zeroed family of DMA allocators
        nios2: use the generic uncached segment support in dma-direct
        nds32: use the generic remapping allocator for coherent DMA allocations
        arc: use the generic remapping allocator for coherent DMA allocations
        dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code
        dma-direct: handle DMA_ATTR_NON_CONSISTENT in common code
        dma-mapping: add a dma_alloc_need_uncached helper
        openrisc: remove the partial DMA_ATTR_NON_CONSISTENT support
        arc: remove the partial DMA_ATTR_NON_CONSISTENT support
        arm-nommu: remove the partial DMA_ATTR_NON_CONSISTENT support
        ARM: dma-mapping: allow larger DMA mask than supported
        dma-mapping: truncate dma masks to what dma_addr_t can hold
        iommu/dma: Apply dma_{alloc,free}_contiguous functions
        dma-remap: Avoid de-referencing NULL atomic_pool
        MIPS: use the generic uncached segment support in dma-direct
        dma-direct: provide generic support for uncached kernel segments
        au1100fb: fix DMA API abuse
        ...
      9e3a25dc
    • Nathan Chancellor's avatar
      coresight: Make the coresight_device_fwnode_match declaration's fwnode parameter const · 9787aed5
      Nathan Chancellor authored
      Fix Linus' merge error in the parent commit, causing:
      
        drivers/hwtracing/coresight/coresight.c:1051:11: error: incompatible pointer types passing 'int (struct device *, void *)' to parameter of type 'int (*)(struct device *, const void *)' [-Werror,-Wincompatible-pointer-types]
                                              coresight_device_fwnode_match);
                                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        include/linux/device.h:173:17: note: passing argument to parameter 'match' here
                                       int (*match)(struct device *dev, const void *data));
                                             ^
      
      due to missed header file fixup.
      
      Fixes: f632a817 ("Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core")
      Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      [ Greg even sent this patch with his pull request, but I stupidly
        thought it was the merge resolution fix I had already done as part of
        the merge. But no, this was the extra fix for the header file
        that goes with the definition I _had_ caught   - Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9787aed5
    • Linus Torvalds's avatar
      Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core · f632a817
      Linus Torvalds authored
      Pull driver core and debugfs updates from Greg KH:
       "Here is the "big" driver core and debugfs changes for 5.3-rc1
      
        It's a lot of different patches, all across the tree due to some api
        changes and lots of debugfs cleanups.
      
        Other than the debugfs cleanups, in this set of changes we have:
      
         - bus iteration function cleanups
      
         - scripts/get_abi.pl tool to display and parse Documentation/ABI
           entries in a simple way
      
         - cleanups to Documenatation/ABI/ entries to make them parse easier
           due to typos and other minor things
      
         - default_attrs use for some ktype users
      
         - driver model documentation file conversions to .rst
      
         - compressed firmware file loading
      
         - deferred probe fixes
      
        All of these have been in linux-next for a while, with a bunch of
        merge issues that Stephen has been patient with me for"
      
      * tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
        debugfs: make error message a bit more verbose
        orangefs: fix build warning from debugfs cleanup patch
        ubifs: fix build warning after debugfs cleanup patch
        driver: core: Allow subsystems to continue deferring probe
        drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
        arch_topology: Remove error messages on out-of-memory conditions
        lib: notifier-error-inject: no need to check return value of debugfs_create functions
        swiotlb: no need to check return value of debugfs_create functions
        ceph: no need to check return value of debugfs_create functions
        sunrpc: no need to check return value of debugfs_create functions
        ubifs: no need to check return value of debugfs_create functions
        orangefs: no need to check return value of debugfs_create functions
        nfsd: no need to check return value of debugfs_create functions
        lib: 842: no need to check return value of debugfs_create functions
        debugfs: provide pr_fmt() macro
        debugfs: log errors when something goes wrong
        drivers: s390/cio: Fix compilation warning about const qualifiers
        drivers: Add generic helper to match by of_node
        driver_find_device: Unify the match function with class_find_device()
        bus_find_device: Unify the match callback with class_find_device
        ...
      f632a817
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · ef8f3d48
      Linus Torvalds authored
      Merge updates from Andrew Morton:
       "Am experimenting with splitting MM up into identifiable subsystems
        perhaps with a view to gitifying it in complex ways. Also with more
        verbose "incoming" emails.
      
        Most of MM is here and a few other trees.
      
        Subsystems affected by this patch series:
         - hotfixes
         - iommu
         - scripts
         - arch/sh
         - ocfs2
         - mm:slab-generic
         - mm:slub
         - mm:kmemleak
         - mm:kasan
         - mm:cleanups
         - mm:debug
         - mm:pagecache
         - mm:swap
         - mm:memcg
         - mm:gup
         - mm:pagemap
         - mm:infrastructure
         - mm:vmalloc
         - mm:initialization
         - mm:pagealloc
         - mm:vmscan
         - mm:tools
         - mm:proc
         - mm:ras
         - mm:oom-kill
      
        hotfixes:
            mm: vmscan: scan anonymous pages on file refaults
            mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
            mm/memcontrol: fix wrong statistics in memory.stat
            mm/z3fold.c: lock z3fold page before  __SetPageMovable()
            nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi header
            MAINTAINERS: nilfs2: update email address
      
        iommu:
            include/linux/dmar.h: replace single-char identifiers in macros
      
        scripts:
            scripts/decode_stacktrace: match basepath using shell prefix operator, not regex
            scripts/decode_stacktrace: look for modules with .ko.debug extension
            scripts/spelling.txt: drop "sepc" from the misspelling list
            scripts/spelling.txt: add spelling fix for prohibited
            scripts/decode_stacktrace: Accept dash/underscore in modules
            scripts/spelling.txt: add more spellings to spelling.txt
      
        arch/sh:
            arch/sh/configs/sdk7786_defconfig: remove CONFIG_LOGFS
            sh: config: remove left-over BACKLIGHT_LCD_SUPPORT
            sh: prevent warnings when using iounmap
      
        ocfs2:
            fs: ocfs: fix spelling mistake "hearbeating" -> "heartbeat"
            ocfs2/dlm: use struct_size() helper
            ocfs2: add last unlock times in locking_state
            ocfs2: add locking filter debugfs file
            ocfs2: add first lock wait time in locking_state
            ocfs: no need to check return value of debugfs_create functions
            fs/ocfs2/dlmglue.c: unneeded variable: "status"
            ocfs2: use kmemdup rather than duplicating its implementation
      
        mm:slab-generic:
          Patch series "mm/slab: Improved sanity checking":
            mm/slab: validate cache membership under freelist hardening
            mm/slab: sanity-check page type when looking up cache
            lkdtm/heap: add tests for freelist hardening
      
        mm:slub:
            mm/slub.c: avoid double string traverse in kmem_cache_flags()
            slub: don't panic for memcg kmem cache creation failure
      
        mm:kmemleak:
            mm/kmemleak.c: fix check for softirq context
            mm/kmemleak.c: change error at _write when kmemleak is disabled
            docs: kmemleak: add more documentation details
      
        mm:kasan:
            mm/kasan: print frame description for stack bugs
            Patch series "Bitops instrumentation for KASAN", v5:
              lib/test_kasan: add bitops tests
              x86: use static_cpu_has in uaccess region to avoid instrumentation
              asm-generic, x86: add bitops instrumentation for KASAN
            Patch series "mm/kasan: Add object validation in ksize()", v3:
              mm/kasan: introduce __kasan_check_{read,write}
              mm/kasan: change kasan_check_{read,write} to return boolean
              lib/test_kasan: Add test for double-kzfree detection
              mm/slab: refactor common ksize KASAN logic into slab_common.c
              mm/kasan: add object validation in ksize()
      
        mm:cleanups:
            include/linux/pfn_t.h: remove pfn_t_to_virt()
            Patch series "remove ARCH_SELECT_MEMORY_MODEL where it has no effect":
              arm: remove ARCH_SELECT_MEMORY_MODEL
              s390: remove ARCH_SELECT_MEMORY_MODEL
              sparc: remove ARCH_SELECT_MEMORY_MODEL
            mm/gup.c: make follow_page_mask() static
            mm/memory.c: trivial clean up in insert_page()
            mm: make !CONFIG_HUGE_PAGE wrappers into static inlines
            include/linux/mm_types.h: ifdef struct vm_area_struct::swap_readahead_info
            mm: remove the account_page_dirtied export
            mm/page_isolation.c: change the prototype of undo_isolate_page_range()
            include/linux/vmpressure.h: use spinlock_t instead of struct spinlock
            mm: remove the exporting of totalram_pages
            include/linux/pagemap.h: document trylock_page() return value
      
        mm:debug:
            mm/failslab.c: by default, do not fail allocations with direct reclaim only
            Patch series "debug_pagealloc improvements":
              mm, debug_pagelloc: use static keys to enable debugging
              mm, page_alloc: more extensive free page checking with debug_pagealloc
              mm, debug_pagealloc: use a page type instead of page_ext flag
      
        mm:pagecache:
            Patch series "fix filler_t callback type mismatches", v2:
              mm/filemap.c: fix an overly long line in read_cache_page
              mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page
              jffs2: pass the correct prototype to read_cache_page
              9p: pass the correct prototype to read_cache_page
            mm/filemap.c: correct the comment about VM_FAULT_RETRY
      
        mm:swap:
            mm, swap: fix race between swapoff and some swap operations
            mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device()
            mm, swap: use rbtree for swap_extent
            mm/mincore.c: fix race between swapoff and mincore
      
        mm:memcg:
            memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL
            memcg, fsnotify: no oom-kill for remote memcg charging
            mm, memcg: introduce memory.events.local
            mm: memcontrol: dump memory.stat during cgroup OOM
            Patch series "mm: reparent slab memory on cgroup removal", v7:
              mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
              mm: memcg/slab: rename slab delayed deactivation functions and fields
              mm: memcg/slab: generalize postponed non-root kmem_cache deactivation
              mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()
              mm: memcg/slab: unify SLAB and SLUB page accounting
              mm: memcg/slab: don't check the dying flag on kmem_cache creation
              mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock
              mm: memcg/slab: rework non-root kmem_cache lifecycle management
              mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages
              mm: memcg/slab: reparent memcg kmem_caches on cgroup removal
            mm, memcg: add a memcg_slabinfo debugfs file
      
        mm:gup:
            Patch series "switch the remaining architectures to use generic GUP", v4:
              mm: use untagged_addr() for get_user_pages_fast addresses
              mm: simplify gup_fast_permitted
              mm: lift the x86_32 PAE version of gup_get_pte to common code
              MIPS: use the generic get_user_pages_fast code
              sh: add the missing pud_page definition
              sh: use the generic get_user_pages_fast code
              sparc64: add the missing pgd_page definition
              sparc64: define untagged_addr()
              sparc64: use the generic get_user_pages_fast code
              mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP
              mm: reorder code blocks in gup.c
              mm: consolidate the get_user_pages* implementations
              mm: validate get_user_pages_fast flags
              mm: move the powerpc hugepd code to mm/gup.c
              mm: switch gup_hugepte to use try_get_compound_head
              mm: mark the page referenced in gup_hugepte
            mm/gup: speed up check_and_migrate_cma_pages() on huge page
            mm/gup.c: remove some BUG_ONs from get_gate_page()
            mm/gup.c: mark undo_dev_pagemap as __maybe_unused
      
        mm:pagemap:
            asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel]
            alpha: switch to generic version of pte allocation
            arm: switch to generic version of pte allocation
            arm64: switch to generic version of pte allocation
            csky: switch to generic version of pte allocation
            m68k: sun3: switch to generic version of pte allocation
            mips: switch to generic version of pte allocation
            nds32: switch to generic version of pte allocation
            nios2: switch to generic version of pte allocation
            parisc: switch to generic version of pte allocation
            riscv: switch to generic version of pte allocation
            um: switch to generic version of pte allocation
            unicore32: switch to generic version of pte allocation
            mm/pgtable: drop pgtable_t variable from pte_fn_t functions
            mm/memory.c: fail when offset == num in first check of __vm_map_pages()
      
        mm:infrastructure:
            mm/mmu_notifier: use hlist_add_head_rcu()
      
        mm:vmalloc:
            Patch series "Some cleanups for the KVA/vmalloc", v5:
              mm/vmalloc.c: remove "node" argument
              mm/vmalloc.c: preload a CPU with one object for split purpose
              mm/vmalloc.c: get rid of one single unlink_va() when merge
              mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()
            mm/vmalloc.c: spelling> s/informaion/information/
      
        mm:initialization:
            mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
            mm/large system hash: clear hashdist when only one node with memory is booted
      
        mm:pagealloc:
            arm64: move jump_label_init() before parse_early_param()
            Patch series "add init_on_alloc/init_on_free boot options", v10:
              mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
              mm: init: report memory auto-initialization features at boot time
      
        mm:vmscan:
            mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
            mm: vmscan: correct some vmscan counters for THP swapout
      
        mm:tools:
            tools/vm/slabinfo: order command line options
            tools/vm/slabinfo: add partial slab listing to -X
            tools/vm/slabinfo: add option to sort by partial slabs
            tools/vm/slabinfo: add sorting info to help menu
      
        mm:proc:
            proc: use down_read_killable mmap_sem for /proc/pid/maps
            proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
            proc: use down_read_killable mmap_sem for /proc/pid/pagemap
            proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
            proc: use down_read_killable mmap_sem for /proc/pid/map_files
            mm: use down_read_killable for locking mmap_sem in access_remote_vm
            mm: smaps: split PSS into components
            mm: vmalloc: show number of vmalloc pages in /proc/meminfo
      
        mm:ras:
            mm/memory-failure.c: clarify error message
      
        mm:oom-kill:
            mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
            mm, oom: refactor dump_tasks for memcg OOMs
            mm, oom: remove redundant task_in_mem_cgroup() check
            oom: decouple mems_allowed from oom_unkillable_task
            mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()"
      
      * akpm: (147 commits)
        mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()
        oom: decouple mems_allowed from oom_unkillable_task
        mm, oom: remove redundant task_in_mem_cgroup() check
        mm, oom: refactor dump_tasks for memcg OOMs
        mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
        mm/memory-failure.c: clarify error message
        mm: vmalloc: show number of vmalloc pages in /proc/meminfo
        mm: smaps: split PSS into components
        mm: use down_read_killable for locking mmap_sem in access_remote_vm
        proc: use down_read_killable mmap_sem for /proc/pid/map_files
        proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
        proc: use down_read_killable mmap_sem for /proc/pid/pagemap
        proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
        proc: use down_read_killable mmap_sem for /proc/pid/maps
        tools/vm/slabinfo: add sorting info to help menu
        tools/vm/slabinfo: add option to sort by partial slabs
        tools/vm/slabinfo: add partial slab listing to -X
        tools/vm/slabinfo: order command line options
        mm: vmscan: correct some vmscan counters for THP swapout
        mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
        ...
      ef8f3d48
    • Tetsuo Handa's avatar
      mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process() · 2c207985
      Tetsuo Handa authored
      Since commit bbbe4802 ("mm, oom: remove 'prefer children over
      parent' heuristic") removed the
      
        "%s: Kill process %d (%s) score %u or sacrifice child\n"
      
      line, oc->chosen_points is no longer used after select_bad_process().
      
      Link: http://lkml.kernel.org/r/1560853435-15575-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c207985
    • Shakeel Butt's avatar
      oom: decouple mems_allowed from oom_unkillable_task · ac311a14
      Shakeel Butt authored
      Commit ef08e3b4 ("[PATCH] cpusets: confine oom_killer to
      mem_exclusive cpuset") introduces a heuristic where a potential
      oom-killer victim is skipped if the intersection of the potential victim
      and the current (the process triggered the oom) is empty based on the
      reason that killing such victim most probably will not help the current
      allocating process.
      
      However the commit 7887a3da ("[PATCH] oom: cpuset hint") changed the
      heuristic to just decrease the oom_badness scores of such potential
      victim based on the reason that the cpuset of such processes might have
      changed and previously they may have allocated memory on mems where the
      current allocating process can allocate from.
      
      Unintentionally 7887a3da ("[PATCH] oom: cpuset hint") introduced a
      side effect as the oom_badness is also exposed to the user space through
      /proc/[pid]/oom_score, so, readers with different cpusets can read
      different oom_score of the same process.
      
      Later, commit 6cf86ac6 ("oom: filter tasks not sharing the same
      cpuset") fixed the side effect introduced by 7887a3da by moving the
      cpuset intersection back to only oom-killer context and out of
      oom_badness.  However the combination of ab290adb ("oom: make
      oom_unkillable_task() helper function") and 26ebc984 ("oom:
      /proc/<pid>/oom_score treat kernel thread honestly") unintentionally
      brought back the cpuset intersection check into the oom_badness
      calculation function.
      
      Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
      oom context is also doing cpuset/mempolicy intersection which is quite
      wrong and is caught by syzcaller with the following report:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      Call Trace:
        oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
        mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
        select_bad_process mm/oom_kill.c:374 [inline]
        out_of_memory mm/oom_kill.c:1088 [inline]
        out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
        mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
        mem_cgroup_oom mm/memcontrol.c:1905 [inline]
        try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
        mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
        mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
        do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
        do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
        wp_huge_pmd mm/memory.c:3793 [inline]
        __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
        handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
        do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
        __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
        do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
        page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
      RIP: 0033:0x400590
      Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
      8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01
      00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
      RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
      RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
      RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
      R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
      Modules linked in:
      ---[ end trace a65689219582ffff ]---
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      
      The fix is to decouple the cpuset/mempolicy intersection check from
      oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
      only done in the global oom context.
      
      [shakeelb@google.com: change function name and update comment]
        Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac311a14
    • Shakeel Butt's avatar
      mm, oom: remove redundant task_in_mem_cgroup() check · 6ba749ee
      Shakeel Butt authored
      oom_unkillable_task() can be called from three different contexts i.e.
      global OOM, memcg OOM and oom_score procfs interface.  At the moment
      oom_unkillable_task() does a task_in_mem_cgroup() check on the given
      process.  Since there is no reason to perform task_in_mem_cgroup()
      check for global OOM and oom_score procfs interface, those contexts
      provide NULL memcg and skips the task_in_mem_cgroup() check.  However
      for memcg OOM context, the oom_unkillable_task() is always called from
      mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
      redundant and effectively dead code.  So, just remove the
      task_in_mem_cgroup() check altogether.
      
      Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ba749ee
    • Shakeel Butt's avatar
      mm, oom: refactor dump_tasks for memcg OOMs · 5eee7e1c
      Shakeel Butt authored
      dump_tasks() traverses all the existing processes even for the memcg OOM
      context which is not only unnecessary but also wasteful.  This imposes a
      long RCU critical section even from a contained context which can be quite
      disruptive.
      
      Change dump_tasks() to be aligned with select_bad_process and use
      mem_cgroup_scan_tasks to selectively traverse only processes of the target
      memcg hierarchy during memcg OOM.
      
      Link: http://lkml.kernel.org/r/20190617231207.160865-1-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5eee7e1c
    • Tetsuo Handa's avatar
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · f168a9a5
      Tetsuo Handa authored
      Since commit c03cd773 ("cgroup: Include dying leaders with live
      threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
      mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
      only one thread from each thread group.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
        Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f168a9a5
    • Jane Chu's avatar
      mm/memory-failure.c: clarify error message · 135e5351
      Jane Chu authored
      Some user who install SIGBUS handler that does longjmp out therefore
      keeping the process alive is confused by the error message
      
        "[188988.765862] Memory failure: 0x1840200: Killing cellsrv:33395 due to hardware memory corruption"
      
      Slightly modify the error message to improve clarity.
      
      Link: http://lkml.kernel.org/r/1558403523-22079-1-git-send-email-jane.chu@oracle.comSigned-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      135e5351
    • Roman Gushchin's avatar
      mm: vmalloc: show number of vmalloc pages in /proc/meminfo · 97105f0a
      Roman Gushchin authored
      Vmalloc() is getting more and more used these days (kernel stacks, bpf and
      percpu allocator are new top users), and the total % of memory consumed by
      vmalloc() can be pretty significant and changes dynamically.
      
      /proc/meminfo is the best place to display this information: its top goal
      is to show top consumers of the memory.
      
      Since the VmallocUsed field in /proc/meminfo is not in use for quite a
      long time (it has been defined to 0 by a5ad88ce ("mm: get rid of
      'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the
      actual physical memory consumption of vmalloc().
      
      Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97105f0a
    • Luigi Semenzato's avatar
      mm: smaps: split PSS into components · ee2ad71b
      Luigi Semenzato authored
      Report separate components (anon, file, and shmem) for PSS in
      smaps_rollup.
      
      This helps understand and tune the memory manager behavior in consumer
      devices, particularly mobile devices.  Many of them (e.g.  chromebooks and
      Android-based devices) use zram for anon memory, and perform disk reads
      for discarded file pages.  The difference in latency is large (e.g.
      reading a single page from SSD is 30 times slower than decompressing a
      zram page on one popular device), thus it is useful to know how much of
      the PSS is anon vs.  file.
      
      All the information is already present in /proc/pid/smaps, but much more
      expensive to obtain because of the large size of that procfs entry.
      
      This patch also removes a small code duplication in smaps_account, which
      would have gotten worse otherwise.
      
      Also updated Documentation/filesystems/proc.txt (the smaps section was a
      bit stale, and I added a smaps_rollup section) and
      Documentation/ABI/testing/procfs-smaps_rollup.
      
      [semenzato@chromium.org: v5]
        Link: http://lkml.kernel.org/r/20190626234333.44608-1-semenzato@chromium.org
      Link: http://lkml.kernel.org/r/20190626180429.174569-1-semenzato@chromium.orgSigned-off-by: default avatarLuigi Semenzato <semenzato@chromium.org>
      Acked-by: default avatarYu Zhao <yuzhao@chromium.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Yu Zhao <yuzhao@chromium.org>
      Cc: Brian Geffon <bgeffon@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee2ad71b
    • Konstantin Khlebnikov's avatar
      mm: use down_read_killable for locking mmap_sem in access_remote_vm · 1e426fe2
      Konstantin Khlebnikov authored
      This function is used by ptrace and proc files like /proc/pid/cmdline and
      /proc/pid/environ.
      
      Access_remote_vm never returns error codes, all errors are ignored and
      only size of successfully read data is returned.  So, if current task was
      killed we'll simply return 0 (bytes read).
      
      Mmap_sem could be locked for a long time or forever if something goes
      wrong.  Using a killable lock permits cleanup of stuck tasks and
      simplifies investigation.
      
      Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzzSigned-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e426fe2