1. 29 Dec, 2018 2 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f346b0be
      Linus Torvalds authored
      Merge misc updates from Andrew Morton:
      
       - large KASAN update to use arm's "software tag-based mode"
      
       - a few misc things
      
       - sh updates
      
       - ocfs2 updates
      
       - just about all of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (167 commits)
        kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
        memcg, oom: notify on oom killer invocation from the charge path
        mm, swap: fix swapoff with KSM pages
        include/linux/gfp.h: fix typo
        mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
        hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
        hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
        memory_hotplug: add missing newlines to debugging output
        mm: remove __hugepage_set_anon_rmap()
        include/linux/vmstat.h: remove unused page state adjustment macro
        mm/page_alloc.c: allow error injection
        mm: migrate: drop unused argument of migrate_page_move_mapping()
        blkdev: avoid migration stalls for blkdev pages
        mm: migrate: provide buffer_migrate_page_norefs()
        mm: migrate: move migrate_page_lock_buffers()
        mm: migrate: lock buffers before migrate_page_move_mapping()
        mm: migration: factor out code to compute expected number of page references
        mm, page_alloc: enable pcpu_drain with zone capability
        kmemleak: add config to select auto scan
        mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
        ...
      f346b0be
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · 00d59fde
      Linus Torvalds authored
      Pull MMC updates from Ulf Hansson:
       "This time, this pull request contains changes crossing subsystems and
        archs/platforms, which is mainly because of a bigger modernization of
        moving from legacy GPIO to GPIO descriptors for MMC (by Linus
        Walleij).
      
        Additionally, once again, I am funneling changes to
        drivers/misc/cardreader/* and drivers/memstick/* through my MMC tree,
        mostly due to that we lack a maintainer for these.
      
        Summary:
      
        MMC core:
         - Cleanup BKOPS support
         - Introduce MMC_CAP_SYNC_RUNTIME_PM
         - slot-gpio: Delete legacy slot GPIO handling
      
        MMC host:
         - alcor: Add new mmc host driver for Alcor Micro PCI based cardreader
         - bcm2835: Several improvements to better recover from errors
         - jz4740: Rework and fixup pre|post_req support
         - mediatek: Add support for SDIO IRQs
         - meson-gx: Improve clock phase management
         - meson-gx: Stop descriptor on errors
         - mmci: Complete the sbc error path by sending a stop command
         - renesas_sdhi/tmio: Fixup reset/resume operations
         - renesas_sdhi: Add support for r8a774c0 and R7S9210
         - renesas_sdhi: Whitelist R8A77990 SDHI
         - renesas_sdhi: Fixup eMMC HS400 compatibility issues for H3 and M3-W
         - rtsx_usb_sdmmc: Re-work card detection/removal support
         - rtsx_usb_sdmmc: Re-work runtime PM support
         - sdhci: Fix timeout loops for some variant drivers
         - sdhci: Improve support for error handling due to failing commands
         - sdhci-acpi/pci: Disable LED control for Intel BYT-based controllers
         - sdhci_am654: Add new SDHCI variant driver to support TI's AM654 SOCs
         - sdhci-of-esdhc: Add support for eMMC HS400 mode
         - sdhci-omap: Fixup reset support
         - sdhci-omap: Workaround errata regarding SDR104/HS200 tuning failures
         - sdhci-msm: Fixup sporadic write transfers issues for SDR104/HS200
         - sdhci-msm: Fixup dynamical clock gating issues
         - various: Complete converting all hosts into using slot GPIO descriptors
      
        Other:
         - Move GPIO mmc platform data for mips/sh/arm to GPIO descriptors
         - Add new Alcor Micro cardreader PCI driver
         - Support runtime power management for memstick rtsx_usb_ms driver
         - Use USB remote wakeups for card detection for rtsx_usb misc driver"
      
      * tag 'mmc-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (99 commits)
        mmc: mediatek: Add MMC_CAP_SDIO_IRQ support
        mmc: renesas_sdhi_internal_dmac: Whitelist r8a774c0
        dt-bindings: mmc: renesas_sdhi: Add r8a774c0 support
        mmc: core: Cleanup BKOPS support
        mmc: core: Drop redundant check in mmc_send_hpi_cmd()
        mmc: sdhci-omap: Workaround errata regarding SDR104/HS200 tuning failures (i929)
        dt-bindings: sdhci-omap: Add note for cpu_thermal
        mmc: sdhci-acpi: Disable LED control for Intel BYT-based controllers
        mmc: sdhci-pci: Disable LED control for Intel BYT-based controllers
        mmc: sdhci: Add quirk to disable LED control
        mmc: mmci: add variant property to set command stop bit
        misc: alcor_pci: fix spelling mistake "invailid" -> "invalid"
        mmc: meson-gx: add signal resampling
        mmc: meson-gx: align default phase on soc vendor tree
        mmc: meson-gx: remove useless lock
        mmc: meson-gx: make sure the descriptor is stopped on errors
        mmc: sdhci_am654: Add Initial Support for AM654 SDHCI driver
        dt-bindings: mmc: sdhci-of-arasan: Add deprecated message for AM65
        dt-bindings: mmc: sdhci-am654: Document bindings for the host controllers on TI's AM654 SOCs
        mmc: sdhci-msm: avoid unused function warning
        ...
      00d59fde
  2. 28 Dec, 2018 38 commits
    • Linus Torvalds's avatar
      Merge tag 'libnvdimm-for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 75f95da0
      Linus Torvalds authored
      Pull libnvdimm updates from Dan Williams:
       "The vast bulk of this update is the new support for the security
        capabilities of some nvdimms.
      
        The userspace tooling for this capability is still a work in progress,
        but the changes survive the existing libnvdimm unit tests. The changes
        also pass manual checkout on hardware and the new nfit_test emulation
        of the security capability.
      
        The touches of the security/keys/ files have received the necessary
        acks from Mimi and David. Those changes were necessary to allow for a
        new generic encrypted-key type, and allow the nvdimm sub-system to
        lookup key material referenced by the libnvdimm-sysfs interface.
      
        Summary:
      
         - Add support for the security features of nvdimm devices that
           implement a security model similar to ATA hard drive security. The
           security model supports locking access to the media at
           device-power-loss, to be unlocked with a passphrase, and
           secure-erase (crypto-scramble).
      
           Unlike the ATA security case where the kernel expects device
           security to be managed in a pre-OS environment, the libnvdimm
           security implementation allows key provisioning and key-operations
           at OS runtime. Keys are managed with the kernel's encrypted-keys
           facility to provide data-at-rest security for the libnvdimm key
           material. The usage model mirrors fscrypt key management, but is
           driven via libnvdimm sysfs.
      
         - Miscellaneous updates for api usage and comment fixes"
      
      * tag 'libnvdimm-for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
        libnvdimm/security: Quiet security operations
        libnvdimm/security: Add documentation for nvdimm security support
        tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test
        tools/testing/nvdimm: Add overwrite support for nfit_test
        tools/testing/nvdimm: Add test support for Intel nvdimm security DSMs
        acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support
        acpi/nfit, libnvdimm/security: Add security DSM overwrite support
        acpi/nfit, libnvdimm: Add support for issue secure erase DSM to Intel nvdimm
        acpi/nfit, libnvdimm: Add enable/update passphrase support for Intel nvdimms
        acpi/nfit, libnvdimm: Add disable passphrase support to Intel nvdimm.
        acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs
        acpi/nfit, libnvdimm: Add freeze security support to Intel nvdimm
        acpi/nfit, libnvdimm: Introduce nvdimm_security_ops
        keys-encrypted: add nvdimm key format type to encrypted keys
        keys: Export lookup_user_key to external users
        acpi/nfit, libnvdimm: Store dimm id as a member to struct nvdimm
        libnvdimm, namespace: Replace kmemdup() with kstrndup()
        libnvdimm, label: Switch to bitmap_zalloc()
        ACPI/nfit: Adjust annotation for why return 0 if fail to find NFIT at start
        libnvdimm, bus: Check id immediately following ida_simple_get
        ...
      75f95da0
    • Linus Torvalds's avatar
      Merge tag 'for-4.21/dm-changes' of... · 4ed7bdc1
      Linus Torvalds authored
      Merge tag 'for-4.21/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper updates from Mike Snitzer:
      
       - Eliminate a couple indirect calls from bio-based DM core.
      
       - Fix DM to allow reads that exceed readahead limits by setting
         io_pages in the backing_dev_info.
      
       - A couple code cleanups in request-based DM.
      
       - Fix various DM targets to check for device sector overflow if
         CONFIG_LBDAF is not set.
      
       - Use u64 instead of sector_t to store iv_offset in DM crypt; sector_t
         isn't large enough on 32bit when CONFIG_LBDAF is not set.
      
       - Performance fixes to DM's kcopyd and the snapshot target focused on
         limiting memory use and workqueue stalls.
      
       - Fix typos in the integrity and writecache targets.
      
       - Log which algorithm is used for dm-crypt's encryption and
         dm-integrity's hashing.
      
       - Fix false -EBUSY errors in DM raid target's handling of check/repair
         messages.
      
       - Fix DM flakey target's corrupt_bio_byte feature to reliably corrupt
         the Nth byte in a bio's payload.
      
      * tag 'for-4.21/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm: do not allow readahead to limit IO size
        dm raid: fix false -EBUSY when handling check/repair message
        dm rq: cleanup leftover code from recently removed q->mq_ops branching
        dm verity: log the hash algorithm implementation
        dm crypt: log the encryption algorithm implementation
        dm integrity: fix spelling mistake in workqueue name
        dm flakey: Properly corrupt multi-page bios.
        dm: Check for device sector overflow if CONFIG_LBDAF is not set
        dm crypt: use u64 instead of sector_t to store iv_offset
        dm kcopyd: Fix bug causing workqueue stalls
        dm snapshot: Fix excessive memory usage and workqueue stalls
        dm bufio: update comment in dm-bufio.c
        dm writecache: fix typo in error msg for creating writecache_flush_thread
        dm: remove indirect calls from __send_changing_extent_only()
        dm mpath: only flush workqueue when needed
        dm rq: remove unused arguments from rq_completed()
        dm: avoid indirect call in __dm_make_request
      4ed7bdc1
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 5d24ae67
      Linus Torvalds authored
      Pull rdma updates from Jason Gunthorpe:
       "This has been a fairly typical cycle, with the usual sorts of driver
        updates. Several series continue to come through which improve and
        modernize various parts of the core code, and we finally are starting
        to get the uAPI command interface cleaned up.
      
         - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4,
           mlx5, qib, rxe, usnic
      
         - Rework the entire syscall flow for uverbs to be able to run over
           ioctl(). Finally getting past the historic bad choice to use
           write() for command execution
      
         - More functional coverage with the mlx5 'devx' user API
      
         - Start of the HFI1 series for 'TID RDMA'
      
         - SRQ support in the hns driver
      
         - Support for new IBTA defined 2x lane widths
      
         - A big series to consolidate all the driver function pointers into a
           big struct and have drivers provide a 'static const' version of the
           struct instead of open coding initialization
      
         - New 'advise_mr' uAPI to control device caching/loading of page
           tables
      
         - Support for inline data in SRPT
      
         - Modernize how umad uses the driver core and creates cdev's and
           sysfs files
      
         - First steps toward removing 'uobject' from the view of the drivers"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits)
        RDMA/srpt: Use kmem_cache_free() instead of kfree()
        RDMA/mlx5: Signedness bug in UVERBS_HANDLER()
        IB/uverbs: Signedness bug in UVERBS_HANDLER()
        IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported
        IB/umad: Start using dev_groups of class
        IB/umad: Use class_groups and let core create class file
        IB/umad: Refactor code to use cdev_device_add()
        IB/umad: Avoid destroying device while it is accessed
        IB/umad: Simplify and avoid dynamic allocation of class
        IB/mlx5: Fix wrong error unwind
        IB/mlx4: Remove set but not used variable 'pd'
        RDMA/iwcm: Don't copy past the end of dev_name() string
        IB/mlx5: Fix long EEH recover time with NVMe offloads
        IB/mlx5: Simplify netdev unbinding
        IB/core: Move query port to ioctl
        RDMA/nldev: Expose port_cap_flags2
        IB/core: uverbs copy to struct or zero helper
        IB/rxe: Reuse code which sets port state
        IB/rxe: Make counters thread safe
        IB/mlx5: Use the correct commands for UMEM and UCTX allocation
        ...
      5d24ae67
    • Linus Torvalds's avatar
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 938edb8a
      Linus Torvalds authored
      Pull SCSI updates from James Bottomley:
       "This is mostly update of the usual drivers: smarpqi, lpfc, qedi,
        megaraid_sas, libsas, zfcp, mpt3sas, hisi_sas.
      
        Additionally, we have a pile of annotation, unused variable and minor
        updates.
      
        The big API change is the updates for Christoph's DMA rework which
        include removing the DISABLE_CLUSTERING flag.
      
        And finally there are a couple of target tree updates"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (259 commits)
        scsi: isci: request: mark expected switch fall-through
        scsi: isci: remote_node_context: mark expected switch fall-throughs
        scsi: isci: remote_device: Mark expected switch fall-throughs
        scsi: isci: phy: Mark expected switch fall-through
        scsi: iscsi: Capture iscsi debug messages using tracepoints
        scsi: myrb: Mark expected switch fall-throughs
        scsi: megaraid: fix out-of-bound array accesses
        scsi: mpt3sas: mpt3sas_scsih: Mark expected switch fall-through
        scsi: fcoe: remove set but not used variable 'port'
        scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown()
        scsi: smartpqi: fix build warnings
        scsi: smartpqi: update driver version
        scsi: smartpqi: add ofa support
        scsi: smartpqi: increase fw status register read timeout
        scsi: smartpqi: bump driver version
        scsi: smartpqi: add smp_utils support
        scsi: smartpqi: correct lun reset issues
        scsi: smartpqi: correct volume status
        scsi: smartpqi: do not offline disks for transient did no connect conditions
        scsi: smartpqi: allow for larger raid maps
        ...
      938edb8a
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping · af7ddd8a
      Linus Torvalds authored
      Pull DMA mapping updates from Christoph Hellwig:
       "A huge update this time, but a lot of that is just consolidating or
        removing code:
      
         - provide a common DMA_MAPPING_ERROR definition and avoid indirect
           calls for dma_map_* error checking
      
         - use direct calls for the DMA direct mapping case, avoiding huge
           retpoline overhead for high performance workloads
      
         - merge the swiotlb dma_map_ops into dma-direct
      
         - provide a generic remapping DMA consistent allocator for
           architectures that have devices that perform DMA that is not cache
           coherent. Based on the existing arm64 implementation and also used
           for csky now.
      
         - improve the dma-debug infrastructure, including dynamic allocation
           of entries (Robin Murphy)
      
         - default to providing chaining scatterlist everywhere, with opt-outs
           for the few architectures (alpha, parisc, most arm32 variants) that
           can't cope with it
      
         - misc sparc32 dma-related cleanups
      
         - remove the dma_mark_clean arch hook used by swiotlb on ia64 and
           replace it with the generic noncoherent infrastructure
      
         - fix the return type of dma_set_max_seg_size (Niklas Söderlund)
      
         - move the dummy dma ops for not DMA capable devices from arm64 to
           common code (Robin Murphy)
      
         - ensure dma_alloc_coherent returns zeroed memory to avoid kernel
           data leaks through userspace. We already did this for most common
           architectures, but this ensures we do it everywhere.
           dma_zalloc_coherent has been deprecated and can hopefully be
           removed after -rc1 with a coccinelle script"
      
      * tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping: (73 commits)
        dma-mapping: fix inverted logic in dma_supported
        dma-mapping: deprecate dma_zalloc_coherent
        dma-mapping: zero memory returned from dma_alloc_*
        sparc/iommu: fix ->map_sg return value
        sparc/io-unit: fix ->map_sg return value
        arm64: default to the direct mapping in get_arch_dma_ops
        PCI: Remove unused attr variable in pci_dma_configure
        ia64: only select ARCH_HAS_DMA_COHERENT_TO_PFN if swiotlb is enabled
        dma-mapping: bypass indirect calls for dma-direct
        vmd: use the proper dma_* APIs instead of direct methods calls
        dma-direct: merge swiotlb_dma_ops into the dma_direct code
        dma-direct: use dma_direct_map_page to implement dma_direct_map_sg
        dma-direct: improve addressability error reporting
        swiotlb: remove dma_mark_clean
        swiotlb: remove SWIOTLB_MAP_ERROR
        ACPI / scan: Refactor _CCA enforcement
        dma-mapping: factor out dummy DMA ops
        dma-mapping: always build the direct mapping code
        dma-mapping: move dma_cache_sync out of line
        dma-mapping: move various slow path functions out of line
        ...
      af7ddd8a
    • Linus Torvalds's avatar
      Merge tag 'for-4.21/libata-20181221' of git://git.kernel.dk/linux-block · fe2b0cda
      Linus Torvalds authored
      Pull libata updates from Jens Axboe:
       "Here are the libata changes for this merge window. Nothing major in
        here. This contains:
      
         - GPIO descriptor conversions (Linus Walleij)
      
         - rcar deferred probing fix (Sergei Shtylyov)"
      
      * tag 'for-4.21/libata-20181221' of git://git.kernel.dk/linux-block:
        sata_rcar: fix deferred probing
        ata: palmld: Introduce state container
        ata: palmld: Convert to GPIO descriptors
        ata: rb532_cf: Convert to use GPIO descriptors
        ata: sata_highbank: Convert to use GPIO descriptors
        ata: pxa: Drop <linux/gpio.h> include
      fe2b0cda
    • Linus Torvalds's avatar
      Merge tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-block · 956eb6cb
      Linus Torvalds authored
      Pull aio updates from Jens Axboe:
       "Flushing out pre-patches for the buffered/polled aio series. Some
        fixes in here, but also optimizations"
      
      * tag 'for-4.21/aio-20181221' of git://git.kernel.dk/linux-block:
        aio: abstract out io_event filler helper
        aio: split out iocb copy from io_submit_one()
        aio: use iocb_put() instead of open coding it
        aio: only use blk plugs for > 2 depth submissions
        aio: don't zero entire aio_kiocb aio_get_req()
        aio: separate out ring reservation from req allocation
        aio: use assigned completion handler
      956eb6cb
    • Linus Torvalds's avatar
      Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block · 0e9da3fb
      Linus Torvalds authored
      Pull block updates from Jens Axboe:
       "This is the main pull request for block/storage for 4.21.
      
        Larger than usual, it was a busy round with lots of goodies queued up.
        Most notable is the removal of the old IO stack, which has been a long
        time coming. No new features for a while, everything coming in this
        week has all been fixes for things that were previously merged.
      
        This contains:
      
         - Use atomic counters instead of semaphores for mtip32xx (Arnd)
      
         - Cleanup of the mtip32xx request setup (Christoph)
      
         - Fix for circular locking dependency in loop (Jan, Tetsuo)
      
         - bcache (Coly, Guoju, Shenghui)
            * Optimizations for writeback caching
            * Various fixes and improvements
      
         - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
            * host and target support for NVMe over TCP
            * Error log page support
            * Support for separate read/write/poll queues
            * Much improved polling
            * discard OOM fallback
            * Tracepoint improvements
      
         - lightnvm (Hans, Hua, Igor, Matias, Javier)
            * Igor added packed metadata to pblk. Now drives without metadata
              per LBA can be used as well.
            * Fix from Geert on uninitialized value on chunk metadata reads.
            * Fixes from Hans and Javier to pblk recovery and write path.
            * Fix from Hua Su to fix a race condition in the pblk recovery
              code.
            * Scan optimization added to pblk recovery from Zhoujie.
            * Small geometry cleanup from me.
      
         - Conversion of the last few drivers that used the legacy path to
           blk-mq (me)
      
         - Removal of legacy IO path in SCSI (me, Christoph)
      
         - Removal of legacy IO stack and schedulers (me)
      
         - Support for much better polling, now without interrupts at all.
           blk-mq adds support for multiple queue maps, which enables us to
           have a map per type. This in turn enables nvme to have separate
           completion queues for polling, which can then be interrupt-less.
           Also means we're ready for async polled IO, which is hopefully
           coming in the next release.
      
         - Killing of (now) unused block exports (Christoph)
      
         - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)
      
         - Support for zoned testing with null_blk (Masato)
      
         - sx8 conversion to per-host tag sets (Christoph)
      
         - IO priority improvements (Damien)
      
         - mq-deadline zoned fix (Damien)
      
         - Ref count blkcg series (Dennis)
      
         - Lots of blk-mq improvements and speedups (me)
      
         - sbitmap scalability improvements (me)
      
         - Make core inflight IO accounting per-cpu (Mikulas)
      
         - Export timeout setting in sysfs (Weiping)
      
         - Cleanup the direct issue path (Jianchao)
      
         - Export blk-wbt internals in block debugfs for easier debugging
           (Ming)
      
         - Lots of other fixes and improvements"
      
      * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
        kyber: use sbitmap add_wait_queue/list_del wait helpers
        sbitmap: add helpers for add/del wait queue handling
        block: save irq state in blkg_lookup_create()
        dm: don't reuse bio for flushes
        nvme-pci: trace SQ status on completions
        nvme-rdma: implement polling queue map
        nvme-fabrics: allow user to pass in nr_poll_queues
        nvme-fabrics: allow nvmf_connect_io_queue to poll
        nvme-core: optionally poll sync commands
        block: make request_to_qc_t public
        nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
        nvme-tcp: fix endianess annotations
        nvmet-tcp: fix endianess annotations
        nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
        nvme-pci: only set nr_maps to 2 if poll queues are supported
        nvmet: use a macro for default error location
        nvmet: fix comparison of a u16 with -1
        blk-mq: enable IO poll if .nr_queues of type poll > 0
        blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
        blk-mq: skip zero-queue maps in blk_mq_map_swqueue
        ...
      0e9da3fb
    • Linus Torvalds's avatar
      Merge tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground · b12a9124
      Linus Torvalds authored
      Pull y2038 updates from Arnd Bergmann:
       "More syscalls and cleanups
      
        This concludes the main part of the system call rework for 64-bit
        time_t, which has spread over most of year 2018, the last six system
        calls being
      
          - ppoll
          - pselect6
          - io_pgetevents
          - recvmmsg
          - futex
          - rt_sigtimedwait
      
        As before, nothing changes for 64-bit architectures, while 32-bit
        architectures gain another entry point that differs only in the layout
        of the timespec structure. Hopefully in the next release we can wire
        up all 22 of those system calls on all 32-bit architectures, which
        gives us a baseline version for glibc to start using them.
      
        This does not include the clock_adjtime, getrusage/waitid, and
        getitimer/setitimer system calls. I still plan to have new versions of
        those as well, but they are not required for correct operation of the
        C library since they can be emulated using the old 32-bit time_t based
        system calls.
      
        Aside from the system calls, there are also a few cleanups here,
        removing old kernel internal interfaces that have become unused after
        all references got removed. The arch/sh cleanups are part of this,
        there were posted several times over the past year without a reaction
        from the maintainers, while the corresponding changes made it into all
        other architectures"
      
      * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
        timekeeping: remove obsolete time accessors
        vfs: replace current_kernel_time64 with ktime equivalent
        timekeeping: remove timespec_add/timespec_del
        timekeeping: remove unused {read,update}_persistent_clock
        sh: remove board_time_init() callback
        sh: remove unused rtc_sh_get/set_time infrastructure
        sh: sh03: rtc: push down rtc class ops into driver
        sh: dreamcast: rtc: push down rtc class ops into driver
        y2038: signal: Add compat_sys_rt_sigtimedwait_time64
        y2038: signal: Add sys_rt_sigtimedwait_time32
        y2038: socket: Add compat_sys_recvmmsg_time64
        y2038: futex: Add support for __kernel_timespec
        y2038: futex: Move compat implementation into futex.c
        io_pgetevents: use __kernel_timespec
        pselect6: use __kernel_timespec
        ppoll: use __kernel_timespec
        signal: Add restore_user_sigmask()
        signal: Add set_user_sigmask()
      b12a9124
    • Matthew Wilcox's avatar
      Fix failure path in alloc_pid() · 1a80dade
      Matthew Wilcox authored
      The failure path removes the allocated PIDs from the wrong namespace.
      This could lead to us inadvertently reusing PIDs in the leaf namespace
      and leaking PIDs in parent namespaces.
      
      Fixes: 95846ecf ("pid: replace pid bitmap implementation with IDR API")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a80dade
    • YueHaibing's avatar
      kernel/fork.c: mark 'stack_vm_area' with __maybe_unused · 0f4991e8
      YueHaibing authored
      Fixes gcc '-Wunused-but-set-variable' warning when CONFIG_VMAP_STACK is
      not set:
      
      kernel/fork.c: In function 'dup_task_struct':
      kernel/fork.c:843:20: warning:
       variable 'stack_vm_area' set but not used [-Wunused-but-set-variable]
      
      Link: http://lkml.kernel.org/r/1545965190-2381-1-git-send-email-yuehaibing@huawei.comSigned-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f4991e8
    • Michal Hocko's avatar
      memcg, oom: notify on oom killer invocation from the charge path · 7056d3a3
      Michal Hocko authored
      Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
      eventfd anymore.  The reason is that 29ef680a ("memcg, oom: move
      out_of_memory back to the charge path") has moved the oom handling back to
      the charge path.  While doing so the notification was left behind in
      mem_cgroup_oom_synchronize.
      
      Fix the issue by replicating the oom hierarchy locking and the
      notification.
      
      Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
      Fixes: 29ef680a ("memcg, oom: move out_of_memory back to the charge path")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarBurt Holzman <burt@fnal.gov>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7056d3a3
    • Huang Ying's avatar
      mm, swap: fix swapoff with KSM pages · 7af7a8e1
      Huang Ying authored
      KSM pages may be mapped to the multiple VMAs that cannot be reached from
      one anon_vma.  So during swapin, a new copy of the page need to be
      generated if a different anon_vma is needed, please refer to comments of
      ksm_might_need_to_copy() for details.
      
      During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
      virtual address mapped to the page, so not all mappings to a swapped out
      KSM page could be found.  So in try_to_unuse(), even if the swap count of
      a swap entry isn't zero, the page needs to be deleted from swap cache, so
      that, in the next round a new page could be allocated and swapin for the
      other mappings of the swapped out KSM page.
      
      But this contradicts with the THP swap support.  Where the THP could be
      deleted from swap cache only after the swap count of every swap entry in
      the huge swap cluster backing the THP has reach 0.  So try_to_unuse() is
      changed in commit e0709829 ("mm, THP, swap: support to reclaim swap
      space for THP swapped out") to check that before delete a page from swap
      cache, but this has broken KSM swapoff too.
      
      Fortunately, KSM is for the normal pages only, so the original behavior
      for KSM pages could be restored easily via checking PageTransCompound().
      That is how this patch works.
      
      The bug is introduced by e0709829 ("mm, THP, swap: support to reclaim
      swap space for THP swapped out"), which is merged by v4.14-rc1.  So I
      think we should backport the fix to from 4.14 on.  But Hugh thinks it may
      be rare for the KSM pages being in the swap device when swapoff, so nobody
      reports the bug so far.
      
      Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
      Fixes: e0709829 ("mm, THP, swap: support to reclaim swap space for THP swapped out")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7af7a8e1
    • Kyle Spiers's avatar
    • Dan Williams's avatar
      mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm · 063a7d1d
      Dan Williams authored
      The kbuild robot reported the following on a development branch that used
      memremap.h in a new path:
      
         In file included from arch/m68k/include/asm/pgtable_mm.h:148:0,
                           from arch/m68k/include/asm/pgtable.h:5,
                           from include/linux/memremap.h:7,
                           from drivers//dax/bus.c:3:
          arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset':
       >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct'
            return mm->pgd + pgd_index(address);
                     ^~
      
      The ->page_fault() callback is specific to HMM.  Move it to 'struct
      hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in
      include/linux/hmm.h.  Longer term refactoring this dependency out of HMM
      is recommended, but in the meantime memremap.h remains generic.
      
      Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 5042db43 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatar"Jérôme Glisse" <jglisse@redhat.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      063a7d1d
    • Mike Kravetz's avatar
      hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race · c86aa7bb
      Mike Kravetz authored
      hugetlbfs page faults can race with truncate and hole punch operations.
      Current code in the page fault path attempts to handle this by 'backing
      out' operations if we encounter the race.  One obvious omission in the
      current code is removing a page newly added to the page cache.  This is
      pretty straight forward to address, but there is a more subtle and
      difficult issue of backing out hugetlb reservations.  To handle this
      correctly, the 'reservation state' before page allocation needs to be
      noted so that it can be properly backed out.  There are four distinct
      possibilities for reservation state: shared/reserved, shared/no-resv,
      private/reserved and private/no-resv.  Backing out a reservation may
      require memory allocation which could fail so that needs to be taken into
      account as well.
      
      Instead of writing the required complicated code for this rare occurrence,
      just eliminate the race.  i_mmap_rwsem is now held in read mode for the
      duration of page fault processing.  Hold i_mmap_rwsem longer in truncation
      and hold punch code to cover the call to remove_inode_hugepages.
      
      With this modification, code in remove_inode_hugepages checking for races
      becomes 'dead' as it can not longer happen.  Remove the dead code and
      expand comments to explain reasoning.  Similarly, checks for races with
      truncation in the page fault path can be simplified and removed.
      
      [mike.kravetz@oracle.com: incorporat suggestions from Kirill]
        Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
      Fixes: ebed4bfc ("hugetlb: fix absurd HugePages_Rsvd")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c86aa7bb
    • Mike Kravetz's avatar
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · b43a9990
      Mike Kravetz authored
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with the
        ptep.
      
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
        called.
      
      [mike.kravetz@oracle.com: add explicit check for mapping != null]
      Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b43a9990
    • Michal Hocko's avatar
      memory_hotplug: add missing newlines to debugging output · 1ecc07fd
      Michal Hocko authored
      pages_correctly_probed is missing new lines which means that the line is
      not printed rightaway but it rather waits for additional printks.
      
      Add \n to all three messages in pages_correctly_probed.
      
      Link: http://lkml.kernel.org/r/20181218162307.10518-1-mhocko@kernel.org
      Fixes: b77eab70 ("mm/memory_hotplug: optimize probe routine")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ecc07fd
    • Kirill Tkhai's avatar
      mm: remove __hugepage_set_anon_rmap() · 451b9514
      Kirill Tkhai authored
      This function is identical to __page_set_anon_rmap() since the time, when
      it was introduced (8 years ago).  The patch removes the function, and
      makes its users to use __page_set_anon_rmap() instead.
      
      Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      451b9514
    • Wei Yang's avatar
    • Benjamin Poirier's avatar
      mm/page_alloc.c: allow error injection · af3b8544
      Benjamin Poirier authored
      Model call chain after should_failslab().  Likewise, we can now use a
      kprobe to override the return value of should_fail_alloc_page() and inject
      allocation failures into alloc_page*().
      
      This will allow injecting allocation failures using the BCC tools even
      without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
      fail_page_alloc= parameter, which incurs some overhead even when failures
      are not being injected.  On the other hand, this patch adds an
      unconditional call to should_fail_alloc_page() from page allocation
      hotpath.  That overhead should be rather negligible with
      CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.
      
      [vbabka@suse.cz: changelog addition]
      Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.comSigned-off-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af3b8544
    • Jan Kara's avatar
      mm: migrate: drop unused argument of migrate_page_move_mapping() · ab41ee68
      Jan Kara authored
      All callers of migrate_page_move_mapping() now pass NULL for 'head'
      argument.  Drop it.
      
      Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab41ee68
    • Jan Kara's avatar
      blkdev: avoid migration stalls for blkdev pages · 88dbcbb3
      Jan Kara authored
      Currently, block device pages don't provide a ->migratepage callback and
      thus fallback_migrate_page() is used for them.  This handler cannot deal
      with dirty pages in async mode and also with the case a buffer head is in
      the LRU buffer head cache (as it has elevated b_count).  Thus such page
      can block memory offlining.
      
      Fix the problem by using buffer_migrate_page_norefs() for migrating block
      device pages.  That function takes care of dropping bh LRU in case
      migration would fail due to elevated buffer refcount to avoid stalls and
      can also migrate dirty pages without writing them.
      
      Link: http://lkml.kernel.org/r/20181211172143.7358-6-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88dbcbb3
    • Jan Kara's avatar
      mm: migrate: provide buffer_migrate_page_norefs() · 89cb0888
      Jan Kara authored
      Provide a variant of buffer_migrate_page() that also checks whether there
      are no unexpected references to buffer heads.  This function will then be
      safe to use for block device pages.
      
      [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)]
      Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89cb0888
    • Jan Kara's avatar
      mm: migrate: move migrate_page_lock_buffers() · 84ade7c1
      Jan Kara authored
      buffer_migrate_page() is the only caller of migrate_page_lock_buffers()
      move it close to it and also drop the now unused stub for !CONFIG_BLOCK.
      
      Link: http://lkml.kernel.org/r/20181211172143.7358-4-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84ade7c1
    • Jan Kara's avatar
      mm: migrate: lock buffers before migrate_page_move_mapping() · cc4f11e6
      Jan Kara authored
      Lock buffers before calling into migrate_page_move_mapping() so that that
      function doesn't have to know about buffers (which is somewhat unexpected
      anyway) and all the buffer head logic is in buffer_migrate_page().
      
      Link: http://lkml.kernel.org/r/20181211172143.7358-3-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc4f11e6
    • Jan Kara's avatar
      mm: migration: factor out code to compute expected number of page references · 0b3901b3
      Jan Kara authored
      Patch series "mm: migrate: Fix page migration stalls for blkdev pages".
      
      This patchset deals with page migration stalls that were reported by our
      customer due to a block device page that had a bufferhead that was in the
      bh LRU cache.
      
      The patchset modifies the page migration code so that bufferheads are
      completely handled inside buffer_migrate_page() and then provides a new
      migration helper for pages with buffer heads that is safe to use even for
      block device pages and that also deals with bh lrus.
      
      This patch (of 6):
      
      Factor out function to compute number of expected page references in
      migrate_page_move_mapping().  Note that we move hpage_nr_pages() and
      page_has_private() checks from under xas_lock_irq() however this is safe
      since we hold page lock.
      
      [jack@suse.cz: fix expected_page_refs()]
        Link: http://lkml.kernel.org/r/20181217131710.GB8611@quack2.suse.cz
      Link: http://lkml.kernel.org/r/20181211172143.7358-2-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b3901b3
    • Wei Yang's avatar
      mm, page_alloc: enable pcpu_drain with zone capability · d9367bd0
      Wei Yang authored
      drain_all_pages is documented to drain per-cpu pages for a given zone (if
      non-NULL).  The current implementation doesn't match the description
      though.  It will drain all pcp pages for all zones that happen to have
      cached pages on the same cpu as the given zone.  This will lead to
      premature pcp cache draining for zones that are not of any interest to the
      caller - e.g.  compaction, hwpoison or memory offline.
      
      This forces the page allocator to take locks and potential lock contention
      as a result.
      
      There is no real reason for this sub-optimal implementation.  Replace
      per-cpu work item with a dedicated structure which contains a pointer to
      the zone and pass it over to the worker.  This will get the zone
      information all the way down to the worker function and do the right job.
      
      [akpm@linux-foundation.org: avoid 80-col tricks]
      [mhocko@suse.com: refactor the whole changelog]
      Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9367bd0
    • Sri Krishna chowdary's avatar
      kmemleak: add config to select auto scan · d53ce042
      Sri Krishna chowdary authored
      Kmemleak scan can be cpu intensive and can stall user tasks at times.  To
      prevent this, add config DEBUG_KMEMLEAK_AUTO_SCAN to enable/disable auto
      scan on boot up.  Also protect first_run with DEBUG_KMEMLEAK_AUTO_SCAN as
      this is meant for only first automatic scan.
      
      Link: http://lkml.kernel.org/r/1540231723-7087-1-git-send-email-prpatel@nvidia.comSigned-off-by: default avatarSri Krishna chowdary <schowdary@nvidia.com>
      Signed-off-by: default avatarSachin Nikam <snikam@nvidia.com>
      Signed-off-by: default avatarPrateek <prpatel@nvidia.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d53ce042
    • Waiman Long's avatar
      mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init · 3c0c12cc
      Waiman Long authored
      When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
      pages initialization can take a long time.  Below were the reported init
      times on a 8-socket 96-core 4TB IvyBridge system.
      
        1) Non-debug kernel without CONFIG_KASAN
           [    8.764222] node 1 initialised, 132086516 pages in 7027ms
      
        2) Debug kernel with CONFIG_KASAN
           [  146.288115] node 1 initialised, 132075466 pages in 143052ms
      
      So the page init time in a debug kernel was 20X of the non-debug kernel.
      The long init time can be problematic as the page initialization is done
      with interrupt disabled.  In this particular case, it caused the
      appearance of following warning messages as well as NMI backtraces of all
      the cores that were doing the initialization.
      
      [   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
      [   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
      [   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
      [   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
      [   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
      [   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
      [   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
      [   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)
      
      The long init time was mainly caused by the call to kasan_free_pages() to
      poison the newly initialized pages.  On a 4TB system, we are talking about
      almost 500GB of memory probably on the same node.
      
      In reality, we may not need to poison the newly initialized pages before
      they are ever allocated.  So KASAN poisoning of freed pages before the
      completion of deferred memory initialization is now disabled.  Those pages
      will be properly poisoned when they are allocated or freed after deferred
      pages initialization is done.
      
      With this change, the new page initialization time became:
      
      [   21.948010] node 1 initialised, 132075466 pages in 18702ms
      
      This was still about double the non-debug kernel time, but was much
      better than before.
      
      Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c0c12cc
    • Peter Xu's avatar
      userfaultfd: clear flag if remap event not enabled · 3cfd22be
      Peter Xu authored
      When the process being tracked does mremap() without
      UFFD_FEATURE_EVENT_REMAP on the corresponding tracking uffd file handle,
      we should not generate the remap event, and at the same time we should
      clear all the uffd flags on the new VMA.  Without this patch, we can still
      have the VM_UFFD_MISSING|VM_UFFD_WP flags on the new VMA even the fault
      handling process does not even know the existance of the VMA.
      
      Link: http://lkml.kernel.org/r/20181211053409.20317-1-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Pravin Shedge <pravin.shedge4linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cfd22be
    • Pingfan Liu's avatar
      mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES · 125b860b
      Pingfan Liu authored
      Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
      If someone adds extra migrate type, then he may forget to enlarge the
      NR_PAGEBLOCK_BITS.  Hence it requires some way to fix.
      
      NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
      two different .h file with reverse dependency, it is a little hard to
      refer to MIGRATE_TYPES in pageblock-flag.h.  This patch tries to remind
      such relation in compiling-time.
      
      Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.comSigned-off-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      125b860b
    • Kirill Tkhai's avatar
      ksm: react on changing "sleep_millisecs" parameter faster · fcf9a0ef
      Kirill Tkhai authored
      ksm thread unconditionally sleeps in ksm_scan_thread() after each
      iteration:
      
      	schedule_timeout_interruptible(
      		msecs_to_jiffies(ksm_thread_sleep_millisecs))
      
      The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.
      
      In case of user writes a big value by a mistake, and the thread enters
      into schedule_timeout_interruptible(), it's not possible to cancel the
      sleep by writing a new smaler value; the thread is just sleeping till
      timeout expires.
      
      The patch fixes the problem by waking the thread each time after the value
      is updated.
      
      This also may be useful for debug purposes; and also for userspace
      daemons, which change sleep_millisecs value in dependence of system load.
      
      Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcf9a0ef
    • Michal Hocko's avatar
      mm, fault_around: do not take a reference to a locked page · e0975b2a
      Michal Hocko authored
      filemap_map_pages takes a speculative reference to each page in the range
      before it tries to lock that page.  While this is correct it also can
      influence page migration which will bail out when seeing an elevated
      reference count.  The faultaround code would bail on seeing a locked page
      so we can pro-actively check the PageLocked bit before
      page_cache_get_speculative and prevent from pointless reference count
      churn.
      
      Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0975b2a
    • Michal Hocko's avatar
      mm, memory_hotplug: deobfuscate migration part of offlining · bb8965bd
      Michal Hocko authored
      Memory migration might fail during offlining and we keep retrying in that
      case.  This is currently obfuscated by goto retry loop.  The code is hard
      to follow and as a result it is even suboptimal becase each retry round
      scans the full range from start_pfn even though we have successfully
      scanned/migrated [start_pfn, pfn] range already.  This is all only because
      check_pages_isolated failure has to rescan the full range again.
      
      De-obfuscate the migration retry loop by promoting it to a real for loop.
      In fact remove the goto altogether by making it a proper double loop
      (yeah, gotos are nasty in this specific case).  In the end we will get a
      slightly more optimal code which is better readable.
      
      [akpm@linux-foundation.org: reflow comments to 80 cols]
      Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8965bd
    • Michal Hocko's avatar
      mm, memory_hotplug: try to migrate full pfn range · a85009c3
      Michal Hocko authored
      Patch series "few memory offlining enhancements".
      
      I have been chasing memory offlining not making progress recently.  On the
      way I have noticed few weird decisions in the code.  The migration itself
      is restricted without a reasonable justification and the retry loop around
      the migration is quite messy.  This is addressed by patch 1 and patch 2.
      
      Patch 3 is targeting on the faultaround code which has been a hot
      candidate for the initial issue reported upstream [2] and that I am
      debugging internally.  It turned out to be not the main contributor in the
      end but I believe we should address it regardless.  See the patch
      description for more details.
      
      [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv
      
      This patch (of 3):
      
      do_migrate_range has been limiting the number of pages to migrate to 256
      for some reason which is not documented.  Even if the limit made some
      sense back then when it was introduced it doesn't really serve a good
      purpose these days.  If the range contains huge pages then we break out of
      the loop too early and go through LRU and pcp caches draining and
      scan_movable_pages is quite suboptimal.
      
      The only reason to limit the number of pages I can think of is to reduce
      the potential time to react on the fatal signal.  But even then the number
      of pages is a questionable metric because even a single page migration
      might block in a non-killable state (e.g.  __unmap_and_move).
      
      Remove the limit and offline the full requested range (this is one
      memblock worth of pages with the current code).  Should we ever get a
      report that offlining takes too long to react on fatal signal then we
      should rather fix the core migration to use killable waits and bailout
      on a signal.
      
      Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a85009c3
    • Michal Hocko's avatar
      mm, proc: report PR_SET_THP_DISABLE in proc · a1400af7
      Michal Hocko authored
      David Rientjes has reported that commit 18600332 ("mm: make
      PR_SET_THP_DISABLE immediately active") has changed the way how we
      report THPable VMAs to the userspace.  Their monitoring tool is
      triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
      an insufficient THP usage as a memory fragmentation resp.  memory
      pressure issue.
      
      Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
      flag and that got exposed to the userspace via /proc/<pid>/smaps file.
      This implementation had its downsides as explained in the commit message
      but it is true that the userspace doesn't have any means to query for
      the process wide THP enabled/disabled status.
      
      PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to
      export in the process wide context rather than per-vma.  Introduce a new
      field to /proc/<pid>/status which export this status.  If
      PR_SET_THP_DISABLE is used then it reports false same as when the THP is
      not compiled in.  It doesn't consider the global THP status because we
      already export that information via sysfs
      
      Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org
      Fixes: 18600332 ("mm: make PR_SET_THP_DISABLE immediately active")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1400af7
    • Michal Hocko's avatar
      mm, thp, proc: report THP eligibility for each vma · 7635d9cb
      Michal Hocko authored
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7635d9cb