1. 01 Feb, 2018 40 commits
    • Linus Torvalds's avatar
      Merge tag 'driver-core-4.16-rc1' of... · 47fcc036
      Linus Torvalds authored
      Merge tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core updates from Greg KH:
       "Here is the set of "big" driver core patches for 4.16-rc1.
      
        The majority of the work here is in the firmware subsystem, with
        reworks to try to attempt to make the code easier to handle in the
        long run, but no functional change. There's also some tree-wide sysfs
        attribute fixups with lots of acks from the various subsystem
        maintainers, as well as a handful of other normal fixes and changes.
      
        And finally, some license cleanups for the driver core and sysfs code.
      
        All have been in linux-next for a while with no reported issues"
      
      * tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (48 commits)
        device property: Define type of PROPERTY_ENRTY_*() macros
        device property: Reuse property_entry_free_data()
        device property: Move property_entry_free_data() upper
        firmware: Fix up docs referring to FIRMWARE_IN_KERNEL
        firmware: Drop FIRMWARE_IN_KERNEL Kconfig option
        USB: serial: keyspan: Drop firmware Kconfig options
        sysfs: remove DEBUG defines
        sysfs: use SPDX identifiers
        drivers: base: add coredump driver ops
        sysfs: add attribute specification for /sysfs/devices/.../coredump
        test_firmware: fix missing unlock on error in config_num_requests_store()
        test_firmware: make local symbol test_fw_config static
        sysfs: turn WARN() into pr_warn()
        firmware: Fix a typo in fallback-mechanisms.rst
        treewide: Use DEVICE_ATTR_WO
        treewide: Use DEVICE_ATTR_RO
        treewide: Use DEVICE_ATTR_RW
        sysfs.h: Use octal permissions
        component: add debugfs support
        bus: simple-pm-bus: convert bool SIMPLE_PM_BUS to tristate
        ...
      47fcc036
    • Linus Torvalds's avatar
      Merge tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 5d8515bc
      Linus Torvalds authored
      Pull staging/IIO updates from Greg KH:
       "Here is the big Staging and IIO driver patches for 4.16-rc1.
      
        There is the normal amount of new IIO drivers added, like all
        releases.
      
        The networking IPX and the ncpfs filesystem are moved into the staging
        tree, as they are on their way out of the kernel due to lack of use
        anymore.
      
        The visorbus subsystem finall has started moving out of the staging
        tree to the "real" part of the kernel, and the most and fsl-mc
        codebases are almost ready to move out, that will probably happen for
        4.17-rc1 if all goes well.
      
        Other than that, there is a bunch of license header cleanups in the
        tree, along with the normal amount of coding style churn that we all
        know and love for this codebase. I also got frustrated at the
        Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
        huge chunks of it that were never even being used.
      
        Full details of everything is in the shortlog.
      
        All of these patches have been in linux-next for a while with no
        reported issues"
      
      * tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (627 commits)
        staging: rtlwifi: remove redundant initialization of 'cfg_cmd'
        staging: rtl8723bs: remove a couple of redundant initializations
        staging: comedi: reformat lines to 80 chars or less
        staging: lustre: separate a connection destroy from free struct kib_conn
        Staging: rtl8723bs: Use !x instead of NULL comparison
        Staging: rtl8723bs: Remove dead code
        Staging: rtl8723bs: Change names to conform to the kernel code
        staging: ccree: Fix missing blank line after declaration
        staging: rtl8188eu: remove redundant initialization of 'pwrcfgcmd'
        staging: rtlwifi: remove unused RTLHALMAC_ST and RTLPHYDM_ST
        staging: fbtft: remove unused FB_TFT_SSD1325 kconfig
        staging: comedi: dt2811: remove redundant initialization of 'ns'
        staging: wilc1000: fix alignments to match open parenthesis
        staging: wilc1000: removed unnecessary defined enums typedef
        staging: wilc1000: remove unnecessary use of parentheses
        staging: rtl8192u: remove redundant initialization of 'timeout'
        staging: sm750fb: fix CamelCase for dispSet var
        staging: lustre: lnet/selftest: fix compile error on UP build
        staging: rtl8723bs: hal_com_phycfg: Remove unneeded semicolons
        staging: rts5208: Fix "seg_no" calculation in reset_ms_card()
        ...
      5d8515bc
    • Linus Torvalds's avatar
      Merge tag 'tty-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · db593322
      Linus Torvalds authored
      Pull tty/staging driver updates from Greg KH:
       "Here is the big tty/serial driver update for 4.16-rc1.
      
        The usual number of various serial driver fixes and updates to try to
        get them to work with crazy hardware configurations (seriously, how
        many different ways are hardware engineers going to come up with to
        hook up a simple UART?)
      
        There is also some serdev bugfixes and updates, as well as a
        smattering of other small fixes in here.
      
        All have been in the linux-next tree for a while, with no reported
        issues"
      
      * tag 'tty-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (65 commits)
        tty: serial: exar: Relocate sleep wake-up handling
        tty: fix data race between tty_init_dev and flush of buf
        serial: imx: fix endless loop during suspend
        serial: core: mark port as initialized after successful IRQ change
        serdev: only match serdev devices
        serdev: do not generate modaliases for controllers
        serial: mxs-auart: don't use GPIOF_* with gpiod_get_direction
        serial: 8250_dw: Revert "Improve clock rate setting"
        MAINTAINERS: Add myself as designated reviewer for 8250_dw
        gpio: serial: max310x: Support open-drain configuration for GPIOs
        serdev: Fix serdev_uevent failure on ACPI enumerated serdev-controllers
        serial: 8250_ingenic: Parse earlycon options
        serial: 8250_ingenic: Add support for the JZ4770 SoC
        serial: core: Make uart_parse_options take const char* argument
        serial: 8250_of: fix return code when probe function fails to get reset
        serial: imx: Only wakeup via RTSDEN bit if the system has RTS/CTS
        serial: 8250_uniphier: fix error return code in uniphier_uart_probe()
        tty: n_gsm: Allow ADM response in addition to UA for control dlci
        tty: omap-serial: Fix initial on-boot RTS GPIO level
        tty: serial: jsm: Add one check against NULL pointer dereference
        ...
      db593322
    • Linus Torvalds's avatar
      Merge tag 'usb-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · e4ee8b85
      Linus Torvalds authored
      Pull USB/PHY updates from Greg KH:
       "Here is the big USB and PHY driver update for 4.16-rc1.
      
        Along with the normally expected XHCI, MUSB, and Gadget driver
        patches, there are some PHY driver fixes, license cleanups, sysfs
        attribute cleanups, usbip changes, and a raft of other smaller fixes
        and additions.
      
        Full details are in the shortlog.
      
        All of these have been in the linux-next tree for a long time with no
        reported issues"
      
      * tag 'usb-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (137 commits)
        USB: serial: pl2303: new device id for Chilitag
        USB: misc: fix up some remaining DEVICE_ATTR() usages
        USB: musb: fix up one odd DEVICE_ATTR() usage
        USB: atm: fix up some remaining DEVICE_ATTR() usage
        USB: move many drivers to use DEVICE_ATTR_WO
        USB: move many drivers to use DEVICE_ATTR_RO
        USB: move many drivers to use DEVICE_ATTR_RW
        USB: misc: chaoskey: Use true and false for boolean values
        USB: storage: remove old wording about how to submit a change
        USB: storage: remove invalid URL from drivers
        usb: ehci-omap: don't complain on -EPROBE_DEFER when no PHY found
        usbip: list: don't list devices attached to vhci_hcd
        usbip: prevent bind loops on devices attached to vhci_hcd
        USB: serial: remove redundant initializations of 'mos_parport'
        usb/gadget: Fix "high bandwidth" check in usb_gadget_ep_match_desc()
        usb: gadget: compress return logic into one line
        usbip: vhci_hcd: update 'status' file header and format
        USB: serial: simple: add Motorola Tetra driver
        CDC-ACM: apply quirk for card reader
        usb: option: Add support for FS040U modem
        ...
      e4ee8b85
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide · 7109a04e
      Linus Torvalds authored
      Pull small IDE cleanup from David Miller.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide:
        ide: remove duplicated assignment to 'cursg'
      7109a04e
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next · ba49097e
      Linus Torvalds authored
      Pull sparc updates from David Miller:
       "Of note is the addition of a driver for the Data Analytics
        Accelerator, and some small cleanups"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
        oradax: Fix return value check in dax_attach()
        sparc: vDSO: remove an extra tab
        sparc64: drop unneeded compat include
        sparc64: Oracle DAX driver
        sparc64: Oracle DAX infrastructure
      ba49097e
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · ca0c836d
      Linus Torvalds authored
      Pull s390 updates from Martin Schwidefsky:
       "Bug fixes, small improvements and one notable change: the system call
        table and the unistd.h header are now generated automatically with a
        shell script from a text file"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/decompressor: discard __ksymtab and .eh_frame sections
        s390: fix handling of -1 in set{,fs}[gu]id16 syscalls
        s390/tools: generate header files in arch/s390/include/generated/
        s390/syscalls: use generated syscall_table.h and unistd.h header files
        s390/syscalls: add Makefile to generate system call header files
        s390/syscalls: add syscalltbl script
        s390/syscalls: add system call table
        s390/decompressor: swap .text and .rodata.compressed sections
        s390/sclp: fix .data section specification
        s390/ipl: avoid usage of __section(.data)
        s390/head: replace hard coded values with constants
        s390/disassembler: add generated gen_opcode_table tool to .gitignore
        s390: remove bogus system call table entries
        s390/kprobes: remove duplicate includes
        s390/dasd: Remove dead return code checks
        s390/dasd: Simplify code
        s390/vdso: revise CFI annotations of vDSO functions
        s390/kernel: emit CFI data in .debug_frame and discard .eh_frame sections
      ca0c836d
    • Linus Torvalds's avatar
      Merge tag 'docs-4.16' of git://git.lwn.net/linux · 255442c9
      Linus Torvalds authored
      Pull documentation updates from Jonathan Corbet:
       "Documentation updates for 4.16.
      
        New stuff includes refcount_t documentation, errseq documentation,
        kernel-doc support for nested structure definitions, the removal of
        lots of crufty kernel-doc support for unused formats, SPDX tag
        documentation, the beginnings of a manual for subsystem maintainers,
        and lots of fixes and updates.
      
        As usual, some of the changesets reach outside of Documentation/ to
        effect kerneldoc comment fixes. It also adds the new LICENSES
        directory, of which Thomas promises I do not need to be the
        maintainer"
      
      * tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
        linux-next: docs-rst: Fix typos in kfigure.py
        linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
        Documentation: Fix misconversion of #if
        docs: add index entry for networking/msg_zerocopy
        Documentation: security/credentials.rst: explain need to sort group_list
        LICENSES: Add MPL-1.1 license
        LICENSES: Add the GPL 1.0 license
        LICENSES: Add Linux syscall note exception
        LICENSES: Add the MIT license
        LICENSES: Add the BSD-3-clause "Clear" license
        LICENSES: Add the BSD 3-clause "New" or "Revised" License
        LICENSES: Add the BSD 2-clause "Simplified" license
        LICENSES: Add the LGPL-2.1 license
        LICENSES: Add the LGPL 2.0 license
        LICENSES: Add the GPL 2.0 license
        Documentation: Add license-rules.rst to describe how to properly identify file licenses
        scripts: kernel_doc: better handle show warnings logic
        fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
        doc: md: Fix a file name to md-fault.c in fault-injection.txt
        errseq: Add to documentation tree
        ...
      255442c9
    • Linus Torvalds's avatar
      Merge branch 'work.vmci' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · d76e0a05
      Linus Torvalds authored
      Pull vmci iov_iter updates from Al Viro:
       "Get rid of "is it an iovec or an entire array?" flags in vmxi - just
        use iov_iter. Simplifies the living hell out of that code..."
      
      * 'work.vmci' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        vmci: the same on the send side...
        vmci: simplify qp_dequeue_locked()
        vmci: get rid of qp_memcpy_from_queue()
        vmci: fix buf_size in case of iovec-based accesses
      d76e0a05
    • Linus Torvalds's avatar
      Merge branch 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 40b9672a
      Linus Torvalds authored
      Pull asm/uaccess.h whack-a-mole from Al Viro:
       "It's linux/uaccess.h, damnit... Oh, well - eventually they'll stop
        cropping up..."
      
      * 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        asm-prototypes.h: use linux/uaccess.h, not asm/uaccess.h
        riscv: use linux/uaccess.h, not asm/uaccess.h...
        ppc: for put_user() pull linux/uaccess.h, not asm/uaccess.h
      40b9672a
    • Linus Torvalds's avatar
      Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · dc1efc3c
      Linus Torvalds authored
      Pull dcache updates from Al Viro:
       "Neil Brown's d_move()/d_path() race fix"
      
      * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        VFS: close race between getcwd() and d_move()
      dc1efc3c
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 73da9e1a
      Linus Torvalds authored
      Merge updates from Andrew Morton:
      
       - misc fixes
      
       - ocfs2 updates
      
       - most of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
        mm: remove PG_highmem description
        tools, vm: new option to specify kpageflags file
        mm/swap.c: make functions and their kernel-doc agree
        mm, memory_hotplug: fix memmap initialization
        mm: correct comments regarding do_fault_around()
        mm: numa: do not trap faults on shared data section pages.
        hugetlb, mbind: fall back to default policy if vma is NULL
        hugetlb, mempolicy: fix the mbind hugetlb migration
        mm, hugetlb: further simplify hugetlb allocation API
        mm, hugetlb: get rid of surplus page accounting tricks
        mm, hugetlb: do not rely on overcommit limit during migration
        mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
        mm, hugetlb: unify core page allocation accounting and initialization
        mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
        mm/memcontrol.c: make local symbol static
        mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
        include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
        mm/compaction.c: fix comment for try_to_compact_pages()
        mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
        zsmalloc: use U suffix for negative literals being shifted
        ...
      73da9e1a
    • Miles Chen's avatar
      mm: remove PG_highmem description · 3f56a2f8
      Miles Chen authored
      Commit cbe37d09 ("[PATCH] mm: remove PG_highmem") removed PG_highmem
      to save a page flag.  So the description of PG_highmem is no longer
      needed.
      
      Link: http://lkml.kernel.org/r/1517391212-2950-1-git-send-email-miles.chen@mediatek.comSigned-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f56a2f8
    • David Rientjes's avatar
      tools, vm: new option to specify kpageflags file · c7905f20
      David Rientjes authored
      page-types currently hardcodes /proc/kpageflags as the file to parse.
      This works when using the tool to examine the state of pageflags on the
      same system, but does not allow storing a snapshot of pageflags at a
      given time to debug issues nor on a different system.
      
      This allows the user to specify a saved version of kpageflags with a new
      page-types -F option.
      
      [akpm@linux-foundation.org: add "filename" to fix usage() string]
      [rientjes@google.com: fix layout]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801301840050.140969@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801301458180.153857@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7905f20
    • Randy Dunlap's avatar
      mm/swap.c: make functions and their kernel-doc agree · e02a9f04
      Randy Dunlap authored
      Fix some basic kernel-doc notation in mm/swap.c:
      
       - for function lru_cache_add_anon(), make its kernel-doc function name
         match its function name and change colon to hyphen following the
         function name
      
       - for function pagevec_lookup_entries(), change the function parameter
         name from nr_pages to nr_entries since that is more descriptive of
         what the parameter actually is and then it matches the kernel-doc
         comments also
      
      Fix function kernel-doc to match the change in commit 67fd707f:
      
       - drop the kernel-doc notation for @nr_pages from
         pagevec_lookup_range() and correct the function description for that
         change
      
      Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org
      Fixes: 67fd707f ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e02a9f04
    • Michal Hocko's avatar
      mm, memory_hotplug: fix memmap initialization · 9bb5a391
      Michal Hocko authored
      Bharata has noticed that onlining a newly added memory doesn't increase
      the total memory, pointing to commit f7f99100 ("mm: stop zeroing
      memory during allocation in vmemmap") as a culprit.  This commit has
      changed the way how the memory for memmaps is initialized and moves it
      from the allocation time to the initialization time.  This works
      properly for the early memmap init path.
      
      It doesn't work for the memory hotplug though because we need to mark
      page as reserved when the sparsemem section is created and later
      initialize it completely during onlining.  memmap_init_zone is called in
      the early stage of onlining.  With the current code it calls
      __init_single_page and as such it clears up the whole stage and
      therefore online_pages_range skips those pages.
      
      Fix this by skipping mm_zero_struct_page in __init_single_page for
      memory hotplug path.  This is quite uggly but unifying both early init
      and memory hotplug init paths is a large project.  Make sure we plug the
      regression at least.
      
      Link: http://lkml.kernel.org/r/20180130101141.GW21609@dhcp22.suse.cz
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarBharata B Rao <bharata@linux.vnet.ibm.com>
      Tested-by: default avatarBharata B Rao <bharata@linux.vnet.ibm.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bb5a391
    • William Kucharski's avatar
      mm: correct comments regarding do_fault_around() · da391d64
      William Kucharski authored
      There are multiple comments surrounding do_fault_around that memtion
      fault_around_pages() and fault_around_mask(), two routines that do not
      exist.  These comments should be reworded to reference
      fault_around_bytes, the value which is used to determine how much
      do_fault_around() will attempt to read when processing a fault.
      
      These comments should have been updated when fault_around_pages() and
      fault_around_mask() were removed in commit aecd6f44 ("mm: close race
      between do_fault_around() and fault_around_bytes_set()").
      
      Fixes: aecd6f44 ("mm: close race between do_fault_around() and fault_around_bytes_set()")
      Link: http://lkml.kernel.org/r/302D0B14-C7E9-44C6-8BED-033F9ACBD030@oracle.comSigned-off-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      Reviewed-by: default avatarLarry Bassel <larry.bassel@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da391d64
    • Henry Willard's avatar
      mm: numa: do not trap faults on shared data section pages. · 859d4adc
      Henry Willard authored
      Workloads consisting of a large number of processes running the same
      program with a very large shared data segment may experience performance
      problems when numa balancing attempts to migrate the shared cow pages.
      This manifests itself with many processes or tasks in
      TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.
      
      The program listed below simulates the conditions with these results
      when run with 288 processes on a 144 core/8 socket machine.
      
      Average throughput 	Average throughput     Average throughput
      with numa_balancing=0	with numa_balancing=1  with numa_balancing=1
           			without the patch      with the patch
      ---------------------	---------------------  ---------------------
      2118782			2021534		       2107979
      
      Complex production environments show less variability and fewer poorly
      performing outliers accompanied with a smaller number of processes
      waiting on NUMA page migration with this patch applied.  In some cases,
      %iowait drops from 16%-26% to 0.
      
        // SPDX-License-Identifier: GPL-2.0
        /*
         * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
         */
        #include <sys/time.h>
        #include <stdio.h>
        #include <wait.h>
        #include <sys/mman.h>
      
        int a[1000000] = {13};
      
        int  main(int argc, const char **argv)
        {
      	int n = 0;
      	int i;
      	pid_t pid;
      	int stat;
      	int *count_array;
      	int cpu_count = 288;
      	long total = 0;
      
      	struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};
      
      	if (argc > 2)
      		cpu_count = atoi(argv[2]);
      
      	count_array = mmap(NULL, cpu_count * sizeof(int),
      			   (PROT_READ|PROT_WRITE),
      			   (MAP_SHARED|MAP_ANONYMOUS), 0, 0);
      
      	if (count_array == MAP_FAILED) {
      		perror("mmap:");
      		return 0;
      	}
      
      	for (i = 0; i < cpu_count; ++i) {
      		pid = fork();
      		if (pid <= 0)
      			break;
      		if ((i & 0xf) == 0)
      			usleep(2);
      	}
      
      	if (pid != 0) {
      		if (i == 0) {
      			perror("fork:");
      			return 0;
      		}
      
      		for (;;) {
      			pid = wait(&stat);
      			if (pid < 0)
      				break;
      		}
      
      		for (i = 0; i < cpu_count; ++i)
      			total += count_array[i];
      
      		printf("Total %ld\n", total);
      		munmap(count_array, cpu_count * sizeof(int));
      		return 0;
      	}
      
      	gettimeofday(&t1, 0);
      	timeradd(&t1, &t2, &t1);
      	while (timercmp(&t2, &t1, <)) {
      		int b = 0;
      		int j;
      
      		for (j = 0; j < 1000000; j++)
      			b += a[j];
      		gettimeofday(&t2, 0);
      		n++;
      	}
      	count_array[i] = n;
      	return 0;
        }
      
      This patch changes change_pte_range() to skip shared copy-on-write pages
      when called from change_prot_numa().
      
      NOTE: change_prot_numa() is nominally called from task_numa_work() and
      queue_pages_test_walk().  task_numa_work() is the auto NUMA balancing
      path, and queue_pages_test_walk() is part of explicit NUMA policy
      management.  However, queue_pages_test_walk() only calls
      change_prot_numa() when MPOL_MF_LAZY is specified and currently that is
      not allowed, so change_prot_numa() is only called from auto NUMA
      balancing.
      
      In the case of explicit NUMA policy management, shared pages are not
      migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL
      depends on CAP_SYS_NICE.  Currently, there is no way to pass information
      about MPOL_MF_MOVE_ALL to change_pte_range.  This will have to be fixed
      if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
      migration mode.
      
      task_numa_work() skips the read-only VMAs of programs and shared
      libraries.
      
      Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.comSigned-off-by: default avatarHenry Willard <henry.willard@oracle.com>
      Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Reviewed-by: default avatarSteve Sistare <steven.sistare@oracle.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      859d4adc
    • Michal Hocko's avatar
      hugetlb, mbind: fall back to default policy if vma is NULL · 389c8178
      Michal Hocko authored
      Dan Carpenter has noticed that mbind migration callback (new_page) can
      get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
      relies on the VMA to get the hstate.  We used to BUG_ON this case but
      the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
      mbind hugetlb migration".
      
      The proper way to handle this is to get the hstate from the migrated
      page and rely on huge_node (resp.  get_vma_policy) do the right thing
      with null VMA.  We are currently falling back to the default mempolicy
      in that case which is in line what THP path is doing here.
      
      Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      389c8178
    • Michal Hocko's avatar
      hugetlb, mempolicy: fix the mbind hugetlb migration · ebd63723
      Michal Hocko authored
      do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
      pages.  alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
      allocation function which has to take care of reserves, overcommit or
      hugetlb cgroup accounting.  None of that is really required for the page
      migration because the new page is only temporal and either will replace
      the original page or it will be dropped.  This is essentially as for
      other migration call paths and there shouldn't be any reason to handle
      mbind in a special way.
      
      The current implementation is even suboptimal because the migration
      might fail just because the hugetlb cgroup limit is reached, or the
      overcommit is saturated.
      
      Fix this by making mbind like other hugetlb migration paths.  Add a new
      migration helper alloc_huge_page_vma as a wrapper around
      alloc_huge_page_nodemask with additional mempolicy handling.
      
      alloc_huge_page_noerr has no more users and it can go.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebd63723
    • Michal Hocko's avatar
      mm, hugetlb: further simplify hugetlb allocation API · 0c397dae
      Michal Hocko authored
      Hugetlb allocator has several layer of allocation functions depending
      and the purpose of the allocation.  There are two allocators depending
      on whether the page can be allocated from the page allocator or we need
      a contiguous allocator.  This is currently opencoded in
      alloc_fresh_huge_page which is the only path that might allocate giga
      pages which require the later allocator.  Create alloc_fresh_huge_page
      which hides this implementation detail and use it in all callers which
      hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page).
      This shouldn't introduce any funtional change because both migration and
      surplus allocators exlude giga pages explicitly.
      
      While we are at it let's do some renaming.  The current scheme is not
      consistent and overly painfull to read and understand.  Get rid of
      prefix underscores from most functions.  There is no real reason to make
      names longer.
      
      * alloc_fresh_huge_page is the new layer to abstract underlying
        allocator
      * __hugetlb_alloc_buddy_huge_page becomes shorter and neater
        alloc_buddy_huge_page.
      * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
        the new page directly to the pool
      * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
        as it uses alloc_fresh_huge_page now
      * others lose their excessive prefix underscores to make names shorter
      
      [dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()]
        Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda
      Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c397dae
    • Michal Hocko's avatar
      mm, hugetlb: get rid of surplus page accounting tricks · 9980d744
      Michal Hocko authored
      alloc_surplus_huge_page increases the pool size and the number of
      surplus pages opportunistically to prevent from races with the pool size
      change.  See commit d1c3fb1f ("hugetlb: introduce
      nr_overcommit_hugepages sysctl") for more details.
      
      The resulting code is unnecessarily hairy, cause code duplication and
      doesn't allow to share the allocation paths.  Moreover pool size changes
      tend to be very seldom so optimizing for them is not really reasonable.
      Simplify the code and allow to allocate a fresh surplus page as long as
      we are under the overcommit limit and then recheck the condition after
      the allocation and drop the new page if the situation has changed.  This
      should provide a reasonable guarantee that an abrupt allocation requests
      will not go way off the limit.
      
      If we consider races with the pool shrinking and enlarging then we
      should be reasonably safe as well.  In the first case we are off by one
      in the worst case and the second case should work OK because the page is
      not yet visible.  We can waste CPU cycles for the allocation but that
      should be acceptable for a relatively rare condition.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-5-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9980d744
    • Michal Hocko's avatar
      mm, hugetlb: do not rely on overcommit limit during migration · ab5ac90a
      Michal Hocko authored
      hugepage migration relies on __alloc_buddy_huge_page to get a new page.
      This has 2 main disadvantages.
      
      1) it doesn't allow to migrate any huge page if the pool is used
         completely which is not an exceptional case as the pool is static and
         unused memory is just wasted.
      
      2) it leads to a weird semantic when migration between two numa nodes
         might increase the pool size of the destination NUMA node while the
         page is in use.  The issue is caused by per NUMA node surplus pages
         tracking (see free_huge_page).
      
      Address both issues by changing the way how we allocate and account
      pages allocated for migration.  Those should temporal by definition.  So
      we mark them that way (we will abuse page flags in the 3rd page) and
      update free_huge_page to free such pages to the page allocator.  Page
      migration path then just transfers the temporal status from the new page
      to the old one which will be freed on the last reference.  The global
      surplus count will never change during this path but we still have to be
      careful when migrating a per-node suprlus page.  This is now handled in
      move_hugetlb_state which is called from the migration path and it copies
      the hugetlb specific page state and fixes up the accounting when needed
      
      Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
      reflect its purpose.  The new allocation routine for the migration path
      is __alloc_migrate_huge_page.
      
      The user visible effect of this patch is that migrated pages are really
      temporal and they travel between NUMA nodes as per the migration
      request:
      
      Before migration
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      After
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      with the previous implementation, both nodes would have nr_hugepages:1
      until the page is freed.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab5ac90a
    • Michal Hocko's avatar
      mm, hugetlb: integrate giga hugetlb more naturally to the allocation path · d9cc948f
      Michal Hocko authored
      Gigantic hugetlb pages were ingrown to the hugetlb code as an alien
      specie with a lot of special casing.  The allocation path is not an
      exception.  Unnecessarily so to be honest.  It is true that the
      underlying allocator is different but that is an implementation detail.
      
      This patch unifies the hugetlb allocation path that a prepares fresh
      pool pages.  alloc_fresh_gigantic_page basically copies
      alloc_fresh_huge_page logic so we can move everything there.  This will
      simplify set_max_huge_pages which doesn't have to care about what kind
      of huge page we allocate.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9cc948f
    • Michal Hocko's avatar
      mm, hugetlb: unify core page allocation accounting and initialization · af0fb9df
      Michal Hocko authored
      Patch series "mm, hugetlb: allocation API and migration improvements"
      
      Motivation:
      
      this is a follow up for [3] for the allocation API and [4] for the
      hugetlb migration.  It wasn't really easy to split those into two
      separate patch series as they share some code.
      
      My primary motivation to touch this code is to make the gigantic pages
      migration working.  The giga pages allocation code is just too fragile
      and hacked into the hugetlb code now.  This series tries to move giga
      pages closer to the first class citizen.  We are not there yet but
      having 5 patches is quite a lot already and it will already make the
      code much easier to follow.  I will come with other changes on top after
      this sees some review.
      
      The first two patches should be trivial to review.  The third patch
      changes the way how we migrate huge pages.  Newly allocated pages are a
      subject of the overcommit check and they participate surplus accounting
      which is quite unfortunate as the changelog explains.  This patch
      doesn't change anything wrt.  giga pages.
      
      Patch #4 removes the surplus accounting hack from
      __alloc_surplus_huge_page.  I hope I didn't miss anything there and a
      deeper review is really due there.
      
      Patch #5 finally unifies allocation paths and giga pages shouldn't be
      any special anymore.  There is also some renaming going on as well.
      
      This patch (of 6):
      
      hugetlb allocator has two entry points to the page allocator
       - alloc_fresh_huge_page_node
       - __hugetlb_alloc_buddy_huge_page
      
      The two differ very subtly in two aspects.  The first one doesn't care
      about HTLB_BUDDY_* stats and it doesn't initialize the huge page.
      prep_new_huge_page is not used because it not only initializes hugetlb
      specific stuff but because it also put_page and releases the page to the
      hugetlb pool which is not what is required in some contexts.  This makes
      things more complicated than necessary.
      
      Simplify things by a) removing the page allocator entry point duplicity
      and only keep __hugetlb_alloc_buddy_huge_page and b) make
      prep_new_huge_page more reusable by removing the put_page which moves
      the page to the allocator pool.  All current callers are updated to call
      put_page explicitly.  Later patches will add new callers which won't
      need it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-2-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af0fb9df
    • Andrey Ryabinin's avatar
      mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes · 1ab5c056
      Andrey Ryabinin authored
      mem_cgroup_resize_[memsw]_limit() tries to free only 32
      (SWAP_CLUSTER_MAX) pages on each iteration.  This makes it practically
      impossible to decrease limit of memory cgroup.  Tasks could easily
      allocate back 32 pages, so we can't reduce memory usage, and once
      retry_count reaches zero we return -EBUSY.
      
      Easy to reproduce the problem by running the following commands:
      
        mkdir /sys/fs/cgroup/memory/test
        echo $$ >> /sys/fs/cgroup/memory/test/tasks
        cat big_file > /dev/null &
        sleep 1 && echo $((100*1024*1024)) > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
        -bash: echo: write error: Device or resource busy
      
      Instead of relying on retry_count, keep retrying the reclaim until the
      desired limit is reached or fail if the reclaim doesn't make any
      progress or a signal is pending.
      
      Link: http://lkml.kernel.org/r/20180119132544.19569-1-aryabinin@virtuozzo.comSigned-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ab5c056
    • Christopher Díaz Riveros's avatar
      mm/memcontrol.c: make local symbol static · 8ad6e404
      Christopher Díaz Riveros authored
      Fix the following sparse warning:
      
        mm/memcontrol.c:1097:14: warning: symbol 'memcg1_stats' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20180118193327.14200-1-chrisadr@gentoo.orgSigned-off-by: default avatarChristopher Díaz Riveros <chrisadr@gentoo.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ad6e404
    • Ralph Campbell's avatar
      mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd() · 8d63e4cd
      Ralph Campbell authored
      The variable 'entry' is used before being initialized in
      hmm_vma_walk_pmd().
      
      No bad effect (beside performance hit) so !non_swap_entry(0) evaluate to
      true which trigger a fault as if CPU was trying to access migrated
      memory and migrate memory back from device memory to regular memory.
      
      This function (hmm_vma_walk_pmd()) is called when a device driver tries
      to populate its own page table.  For migrated memory it should not
      happen as the device driver should already have populated its page table
      correctly during the migration.
      
      Only case I can think of is multi-GPU where a second GPU triggers
      migration back to regular memory.  Again this would just result in a
      performance hit, nothing bad would happen.
      
      Link: http://lkml.kernel.org/r/20180122185759.26286-1-jglisse@redhat.comSigned-off-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d63e4cd
    • Petr Tesarik's avatar
      include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer · def9b71e
      Petr Tesarik authored
      The comment is confusing.  On the one hand, it refers to 32-bit
      alignment (struct page alignment on 32-bit platforms), but this would
      only guarantee that the 2 lowest bits must be zero.  On the other hand,
      it claims that at least 3 bits are available, and 3 bits are actually
      used.
      
      This is not broken, because there is a stronger alignment guarantee,
      just less obvious.  Let's fix the comment to make it clear how many bits
      are available and why.
      
      Although memmap arrays are allocated in various places, the resulting
      pointer is encoded eventually, so I am adding a BUG_ON() here to enforce
      at runtime that all expected bits are indeed available.
      
      I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is
      sufficient, because this part of the calculation can be easily checked
      at build time.
      
      [ptesarik@suse.com: v2]
        Link: http://lkml.kernel.org/r/20180125100516.589ea6af@ezekiel.suse.cz
      Link: http://lkml.kernel.org/r/20180119080908.3a662e6f@ezekiel.suse.czSigned-off-by: default avatarPetr Tesarik <ptesarik@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      def9b71e
    • Yang Shi's avatar
      mm/compaction.c: fix comment for try_to_compact_pages() · 112d2d29
      Yang Shi authored
      "mode" argument is not used by try_to_compact_pages() and sub functions
      anymore, it has been replaced by "prio".  Fix the comment to explain the
      use of "prio" argument.
      
      Link: http://lkml.kernel.org/r/1515801336-20611-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      112d2d29
    • Oscar Salvador's avatar
      mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it · 3a45acc0
      Oscar Salvador authored
      static struct page_ext_operations *page_ext_ops[] always contains debug_guardpage_ops,
      
      static struct page_ext_operations *page_ext_ops[] = {
              &debug_guardpage_ops,
       #ifdef CONFIG_PAGE_OWNER
              &page_owner_ops,
       #endif
      ...
      }
      
      but for it to work, CONFIG_DEBUG_PAGEALLOC must be enabled first.  If
      someone has CONFIG_PAGE_EXTENSION, but has none of its users, eg:
      (CONFIG_PAGE_OWNER, CONFIG_DEBUG_PAGEALLOC, CONFIG_IDLE_PAGE_TRACKING),
      we can shrink page_ext_init() to a simple retq.
      
        $ size vmlinux  (before patch)
              text      data       bss       dec       hex  filename
          14356698   5681582   1687748  21726028   14b834c  vmlinux
      
        $ size vmlinux  (after patch)
              text      data       bss       dec       hex  filename
          14356008   5681538   1687748  21725294   14b806e  vmlinux
      
      On the other hand, it might does not even make sense, since if someone
      enables CONFIG_PAGE_EXTENSION, I would expect him to enable also at
      least one of its users.
      
      Link: http://lkml.kernel.org/r/20180105130235.GA21241@techadventures.netSigned-off-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a45acc0
    • Nick Desaulniers's avatar
      zsmalloc: use U suffix for negative literals being shifted · 01a6ad9a
      Nick Desaulniers authored
      Fix warning about shifting unsigned literals being undefined behavior.
      
      Link: http://lkml.kernel.org/r/1515642078-4259-1-git-send-email-nick.desaulniers@gmail.comSigned-off-by: default avatarNick Desaulniers <nick.desaulniers@gmail.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Nick Desaulniers <nick.desaulniers@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01a6ad9a
    • Oscar Salvador's avatar
      mm/page_owner.c: clean up init_pages_in_zone() · 6787c1da
      Oscar Salvador authored
      Remove two redundant assignments in init_pages_in_zone().
      
      [osalvador@techadventures.net: v3]
        Link: http://lkml.kernel.org/r/20180117124513.GA876@techadventures.net
      [akpm@linux-foundation.org: coding style tweaks]
      Link: http://lkml.kernel.org/r/20180110084355.GA22822@techadventures.netSigned-off-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6787c1da
    • Shile Zhang's avatar
    • Yu Zhao's avatar
      memcg: refactor mem_cgroup_resize_limit() · c054a78c
      Yu Zhao authored
      mem_cgroup_resize_limit() and mem_cgroup_resize_memsw_limit() have
      identical logics.  Refactor code so we don't need to keep two pieces of
      code that does same thing.
      
      Link: http://lkml.kernel.org/r/20180108224238.14583-1-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c054a78c
    • Yu Zhao's avatar
      zswap: only save zswap header when necessary · 9c3760eb
      Yu Zhao authored
      We waste sizeof(swp_entry_t) for zswap header when using zsmalloc as
      zpool driver because zsmalloc doesn't support eviction.
      
      Add zpool_evictable() to detect if zpool is potentially evictable, and
      use it in zswap to avoid waste memory for zswap header.
      
      [yuzhao@google.com: The zpool->" prefix is a result of copy & paste]
        Link: http://lkml.kernel.org/r/20180110225626.110330-1-yuzhao@google.com
      Link: http://lkml.kernel.org/r/20180110224741.83751-1-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarDan Streetman <ddstreet@ieee.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c3760eb
    • shidao.ytt's avatar
      mm/fadvise: discard partial page if endbyte is also EOF · a7ab400d
      shidao.ytt authored
      During our recent testing with fadvise(FADV_DONTNEED), we find that if
      given offset/length is not page-aligned, the last page will not be
      discarded.  The tool we use is vmtouch (https://hoytech.com/vmtouch/),
      we map a 10KB-sized file into memory and then try to run this tool to
      evict the whole file mapping, but the last single page always remains
      staying in the memory:
      
      $./vmtouch -e test_10K
                 Files: 1
           Directories: 0
         Evicted Pages: 3 (12K)
               Elapsed: 2.1e-05 seconds
      
      $./vmtouch test_10K
                 Files: 1
           Directories: 0
        Resident Pages: 1/3  4K/12K  33.3%
               Elapsed: 5.5e-05 seconds
      
      However when we test with an older kernel, say 3.10, this problem is
      gone.  So we wonder if this is a regression:
      
      $./vmtouch -e test_10K
                 Files: 1
           Directories: 0
         Evicted Pages: 3 (12K)
               Elapsed: 8.2e-05 seconds
      
      $./vmtouch test_10K
                 Files: 1
           Directories: 0
        Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
               Elapsed: 5e-05 seconds
      
      After digging a little bit into this problem, we find it seems not a
      regression.  Not discarding partial page is likely to be on purpose
      according to commit 441c228f ("mm: fadvise: document the
      fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel
      Gorman.  He explained why partial pages should be preserved instead of
      being discarded when using fadvise(FADV_DONTNEED).
      
      However, the interesting part is that the actual code did NOT work as
      the same as it was described, the partial page was still discarded
      anyway, due to a calculation mistake of `end_index' passed to
      invalidate_mapping_pages().  This mistake has not been fixed until
      recently, that's why we fail to reproduce our problem in old kernels.
      The fix is done in commit 18aba41c ("mm/fadvise.c: do not discard
      partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin.
      
      Back to the original testing, our problem becomes that there is a
      special case that, if the page-unaligned `endbyte' is also the end of
      file, it is not necessary at all to preserve the last partial page, as
      we all know no one else will use the rest of it.  It should be safe
      enough if we just discard the whole page.  So we add an EOF check in
      this patch.
      
      We also find a poosbile real world issue in mainline kernel.  Assume
      such scenario: A userspace backup application want to backup a huge
      amount of small files (<4k) at once, the developer might (I guess) want
      to use fadvise(FADV_DONTNEED) to save memory.  However, FADV_DONTNEED
      won't really happen since the only page mapped is a partial page, and
      kernel will preserve it.  Our patch also fixes this problem, since we
      know the endbyte is EOF, so we discard it.
      
      Here is a simple reproducer to reproduce and verify each scenario we
      described above:
      
        test_fadvise.c
        ==============================
        #include <sys/mman.h>
        #include <sys/stat.h>
        #include <fcntl.h>
        #include <stdlib.h>
        #include <string.h>
        #include <stdio.h>
        #include <unistd.h>
      
        int main(int argc, char **argv)
        {
        	int i, fd, ret, len;
        	struct stat buf;
        	void *addr;
        	unsigned char *vec;
        	char *strbuf;
        	ssize_t pagesize = getpagesize();
        	ssize_t filesize;
      
        	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
        	if (fd < 0)
        		return -1;
        	filesize = strtoul(argv[2], NULL, 10);
      
        	strbuf = malloc(filesize);
        	memset(strbuf, 42, filesize);
        	write(fd, strbuf, filesize);
        	free(strbuf);
        	fsync(fd);
      
        	len = (filesize + pagesize - 1) / pagesize;
        	printf("length of pages: %d\n", len);
      
        	addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
        	if (addr == MAP_FAILED)
        		return -1;
      
        	ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
        	if (ret < 0)
        		return -1;
      
        	vec = malloc(len);
        	ret = mincore(addr, filesize, (void *)vec);
        	if (ret < 0)
        		return -1;
      
        	for (i = 0; i < len; i++)
        		printf("pages[%d]: %x\n", i, vec[i] & 0x1);
      
        	free(vec);
        	close(fd);
      
        	return 0;
        }
        ==============================
      
      Test 1: running on kernel with commit 18aba41c reverted:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6.revert+
        [root@caspar ~]# ./test_fadvise file1 1024
        length of pages: 1
        pages[0]: 0    # <-- partial page discarded
        [root@caspar ~]# ./test_fadvise file2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise file3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 0    # <-- partial page discarded
      
      Test 2: running on mainline kernel:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6+
        [root@caspar ~]# ./test_fadvise test1 1024
        length of pages: 1
        pages[0]: 1    # <-- partial and the only page not discarded
        [root@caspar ~]# ./test_fadvise test2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise test3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 1    # <-- partial page not discarded
      
      Test 3: running on kernel with this patch:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6.patched+
        [root@caspar ~]# ./test_fadvise test1 1024
        length of pages: 1
        pages[0]: 0    # <-- partial page and EOF, discarded
        [root@caspar ~]# ./test_fadvise test2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise test3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 0    # <-- partial page and EOF, discarded
      
      [akpm@linux-foundation.org: tweak code comment]
      Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.comSigned-off-by: default avatarshidao.ytt <shidao.ytt@alibaba-inc.com>
      Signed-off-by: default avatarCaspar Zhang <jinli.zjl@alibaba-inc.com>
      Reviewed-by: default avatarOliver Yang <zhiche.yy@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7ab400d
    • Mel Gorman's avatar
      mm: pin address_space before dereferencing it while isolating an LRU page · 69d763fc
      Mel Gorman authored
      Minchan Kim asked the following question -- what locks protects
      address_space destroying when race happens between inode trauncation and
      __isolate_lru_page? Jan Kara clarified by describing the race as follows
      
      CPU1                                            CPU2
      
      truncate(inode)                                 __isolate_lru_page()
        ...
        truncate_inode_page(mapping, page);
          delete_from_page_cache(page)
            spin_lock_irqsave(&mapping->tree_lock, flags);
              __delete_from_page_cache(page, NULL)
                page_cache_tree_delete(..)
                  ...                                   mapping = page_mapping(page);
                  page->mapping = NULL;
                  ...
            spin_unlock_irqrestore(&mapping->tree_lock, flags);
            page_cache_free_page(mapping, page)
              put_page(page)
                if (put_page_testzero(page)) -> false
      - inode now has no pages and can be freed including embedded address_space
      
                                                        if (mapping && !mapping->a_ops->migratepage)
      - we've dereferenced mapping which is potentially already free.
      
      The race is theoretically possible but unlikely.  Before the
      delete_from_page_cache, truncate_cleanup_page is called so the page is
      likely to be !PageDirty or PageWriteback which gets skipped by the only
      caller that checks the mappping in __isolate_lru_page.  Even if the race
      occurs, a substantial amount of work has to happen during a tiny window
      with no preemption but it could potentially be done using a virtual
      machine to artifically slow one CPU or halt it during the critical
      window.
      
      This patch should eliminate the race with truncation by try-locking the
      page before derefencing mapping and aborting if the lock was not
      acquired.  There was a suggestion from Huang Ying to use RCU as a
      side-effect to prevent mapping being freed.  However, I do not like the
      solution as it's an unconventional means of preserving a mapping and
      it's not a context where rcu_read_lock is obviously protecting rcu data.
      
      Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
      Fixes: c8244935 ("mm: compaction: make isolate_lru_page() filter-aware again")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69d763fc
    • Eric Biggers's avatar
      userfaultfd: convert to use anon_inode_getfd() · 284cd241
      Eric Biggers authored
      Nothing actually calls userfaultfd_file_create() besides the
      userfaultfd() system call itself.  So simplify things by folding it into
      the system call and using anon_inode_getfd() instead of
      anon_inode_getfile().  Do the same in resolve_userfault_fork() as well.
      
      This removes over 50 lines with no change in functionality.
      
      Link: http://lkml.kernel.org/r/20171229212403.22800-1-ebiggers3@gmail.comSigned-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      284cd241
    • Marc-André Lureau's avatar
      memfd-test: run fuse test on hugetlb backend memory · c5c63835
      Marc-André Lureau authored
      Link: http://lkml.kernel.org/r/20171107122800.25517-10-marcandre.lureau@redhat.comSigned-off-by: default avatarMarc-André Lureau <marcandre.lureau@redhat.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5c63835