1. 04 Sep, 2015 40 commits
    • Sebastian Andrzej Siewior's avatar
      mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout() · ce9ce665
      Sebastian Andrzej Siewior authored
      Clark stumbled over a VM_BUG_ON() in -RT which was then was removed by
      Johannes in commit f371763a ("mm: memcontrol: fix false-positive
      VM_BUG_ON() on -rt").  The comment before that patch was a tiny bit better
      than it is now.  While the patch claimed to fix a false-postive on -RT
      this was not the case.  None of the -RT folks ACKed it and it was not a
      false positive report.  That was a *real* problem.
      
      This patch updates the comment that is improper because it refers to
      "disabled preemption" as a consequence of that lock being taken.  A
      spin_lock() disables preemption, true, but in this case the code relies on
      the fact that the lock _also_ disables interrupts once it is acquired.
      And this is the important detail (which was checked the VM_BUG_ON()) which
      needs to be pointed out.  This is the hint one needs while looking at the
      code.  It was explained by Johannes on the list that the per-CPU variables
      are protected by local_irq_save().  The BUG_ON() was helpful.  This code
      has been workarounded in -RT in the meantime.  I wouldn't mind running
      into more of those if the code in question uses *special* kind of locking
      since now there is no verification (in terms of lockdep or BUG_ON()) and
      therefore I bring the VM_BUG_ON() check back in.
      
      The two functions after the comment could also have a "local_irq_save()"
      dance around them in order to serialize access to the per-CPU variables.
      This has been avoided because the interrupts should be off.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce9ce665
    • Vladimir Zapolskiy's avatar
      genalloc: add support of multiple gen_pools per device · c98c3635
      Vladimir Zapolskiy authored
      This change fills devm_gen_pool_create()/gen_pool_get() "name" argument
      stub with contents and extends of_gen_pool_get() functionality on this
      basis.
      
      If there is no associated platform device with a device node passed to
      of_gen_pool_get(), the function attempts to get a label property or device
      node name (= repeats MTD OF partition standard) and seeks for a named
      gen_pool registered by device of the parent device node.
      
      The main idea of the change is to allow registration of independent
      gen_pools under the same umbrella device, say "partitions" on "storage
      device", the original functionality of one "partition" per "storage
      device" is untouched.
      
      [akpm@linux-foundation.org: fix constness in devres_find()]
      [dan.carpenter@oracle.com: freeing const data pointers]
      Signed-off-by: default avatarVladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
      Cc: Philipp Zabel <p.zabel@pengutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
      Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Sascha Hauer <kernel@pengutronix.de>
      Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c98c3635
    • Vladimir Zapolskiy's avatar
      genalloc: add name arg to gen_pool_get() and devm_gen_pool_create() · 73858173
      Vladimir Zapolskiy authored
      This change modifies gen_pool_get() and devm_gen_pool_create() client
      interfaces adding one more argument "name" of a gen_pool object.
      
      Due to implementation gen_pool_get() is capable to retrieve only one
      gen_pool associated with a device even if multiple gen_pools are created,
      fortunately right at the moment it is sufficient for the clients, hence
      provide NULL as a valid argument on both producer devm_gen_pool_create()
      and consumer gen_pool_get() sides.
      
      Because only one created gen_pool per device is addressable, explicitly
      add a restriction to devm_gen_pool_create() to create only one gen_pool
      per device, this implies two possible error codes returned by the
      function, account it on client side (only misc/sram).  This completes
      client side changes related to genalloc updates.
      
      [akpm@linux-foundation.org: gen_pool_get() cleanup]
      Signed-off-by: default avatarVladimir Zapolskiy <vladimir_zapolskiy@mentor.com>
      Cc: Philipp Zabel <p.zabel@pengutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
      Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Sascha Hauer <kernel@pengutronix.de>
      Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73858173
    • Wei Yang's avatar
      mm/memblock: WARN_ON when nid differs from overlap region · c0a29498
      Wei Yang authored
      Each memblock_region has nid to indicates the Node ID of this range.  For
      the overlap case, memblock_add_range() inserts the lower part and leave
      the upper part as indicated in the overlapped region.
      
      If the nid of the new range differs from the overlapped region, the
      information recorded is not correct.
      
      This patch adds a WARN_ON when the nid of the new range differs from the
      overlapped region.
      Signed-off-by: default avatarWei Yang <weiyang@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0a29498
    • Mel Gorman's avatar
      Documentation/features/vm: add feature description and arch support status for... · c7e1e3cc
      Mel Gorman authored
      Documentation/features/vm: add feature description and arch support status for batched TLB flush after unmap
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7e1e3cc
    • Mel Gorman's avatar
      mm: defer flush of writable TLB entries · d950c947
      Mel Gorman authored
      If a PTE is unmapped and it's dirty then it was writable recently.  Due to
      deferred TLB flushing, it's best to assume a writable TLB cache entry
      exists.  With that assumption, the TLB must be flushed before any IO can
      start or the page is freed to avoid lost writes or data corruption.  This
      patch defers flushing of potentially writable TLBs as long as possible.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d950c947
    • Mel Gorman's avatar
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman authored
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
    • Mel Gorman's avatar
      x86, mm: trace when an IPI is about to be sent · 5b74283a
      Mel Gorman authored
      When unmapping pages it is necessary to flush the TLB.  If that page was
      accessed by another CPU then an IPI is used to flush the remote CPU.  That
      is a lot of IPIs if kswapd is scanning and unmapping >100K pages per
      second.
      
      There already is a window between when a page is unmapped and when it is
      TLB flushed.  This series increases the window so multiple pages can be
      flushed using a single IPI.  This should be safe or the kernel is hosed
      already.
      
      Patch 1 simply made the rest of the series easier to write as ftrace
              could identify all the senders of TLB flush IPIS.
      
      Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
              to flush the entire TLB.
      
      Patch 3 tracks when there potentially are writable TLB entries that
              need to be batched differently
      
      Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes
      
      The performance impact is documented in the changelogs but in the optimistic
      case on a 4-socket machine the full series reduces interrupts from 900K
      interrupts/second to 60K interrupts/second.
      
      This patch (of 4):
      
      It is easy to trace when an IPI is received to flush a TLB but harder to
      detect what event sent it.  This patch makes it easy to identify the
      source of IPIs being transmitted for TLB flushes on x86.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b74283a
    • Andrea Arcangeli's avatar
      userfaultfd: selftest · c47174fc
      Andrea Arcangeli authored
      This test allocates two virtual areas and bounces the physical memory
      across the two virtual areas using only userfaultfd.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c47174fc
    • Andrea Arcangeli's avatar
      userfaultfd: avoid missing wakeups during refile in userfaultfd_read · 2c5b7e1b
      Andrea Arcangeli authored
      During the refile in userfaultfd_read both waitqueues could look empty to
      the lockless wake_userfault().  Use a seqcount to prevent this false
      negative that could leave an userfault blocked.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c5b7e1b
    • Andrea Arcangeli's avatar
      userfaultfd: propagate the full address in THP faults · 230c92a8
      Andrea Arcangeli authored
      The THP faults were not propagating the original fault address.  The
      latest version of the API with uffd.arg.pagefault.address is supposed to
      propagate the full address through THP faults.
      
      This was not a kernel crashing bug and it wouldn't risk to corrupt user
      memory, but it would cause a SIGBUS failure because the wrong page was
      being copied.
      
      For various reasons this wasn't easily reproducible in the qemu workload,
      but the strestest exposed the problem immediately.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      230c92a8
    • Andrea Arcangeli's avatar
      userfaultfd: allow signals to interrupt a userfault · dfa37dc3
      Andrea Arcangeli authored
      This is only simple to achieve if the userfault is going to return to
      userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY
      despite we temporarily released the mmap_sem.  The fault would just be
      retried by userland then.  This is safe at least on x86 and powerpc (the
      two archs with the syscall implemented so far).
      
      Hint to verify for which archs this is safe: after handle_mm_fault
      returns, no access to data structures protected by the mmap_sem must be
      done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem)
      is called.
      
      This has two main benefits: signals can run with lower latency in
      production (signals aren't blocked by userfaults and userfaults are
      immediately repeated after signal processing) and gdb can then trivially
      debug the threads blocked in this kind of userfaults coming directly from
      userland.
      
      On a side note: while gdb has a need to get signal processed, coredumps
      always worked perfectly with userfaults, no matter if the userfault is
      triggered by GUP a kernel copy_user or directly from userland.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfa37dc3
    • Andrea Arcangeli's avatar
      userfaultfd: require UFFDIO_API before other ioctls · e6485a47
      Andrea Arcangeli authored
      UFFDIO_API was already forced before read/poll could work.  This makes the
      code more strict to force it also for all other ioctls.
      
      All users would already have been required to call UFFDIO_API before
      invoking other ioctls but this makes it more explicit.
      
      This will ensure we can change all ioctls (all but UFFDIO_API/struct
      uffdio_api) with a bump of uffdio_api.api.
      
      There's no actual plan or need to change the API or the ioctl, the current
      API already should cover fine even the non cooperative usage, but this is
      just for the longer term future just in case.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6485a47
    • Andrea Arcangeli's avatar
      userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE · ad465cae
      Andrea Arcangeli authored
      These two ioctl allows to either atomically copy or to map zeropages
      into the virtual address space. This is used by the thread that opened
      the userfaultfd to resolve the userfaults.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad465cae
    • Andrea Arcangeli's avatar
      userfaultfd: avoid mmap_sem read recursion in mcopy_atomic · b6ebaedb
      Andrea Arcangeli authored
      If the rwsem starves writers it wasn't strictly a bug but lockdep
      doesn't like it and this avoids depending on lowlevel implementation
      details of the lock.
      
      [akpm@linux-foundation.org: delete weird BUILD_BUG_ON()]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6ebaedb
    • Andrea Arcangeli's avatar
      userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation · c1a4de99
      Andrea Arcangeli authored
      This implements mcopy_atomic and mfill_zeropage that are the lowlevel
      VM methods that are invoked respectively by the UFFDIO_COPY and
      UFFDIO_ZEROPAGE userfaultfd commands.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1a4de99
    • Andrea Arcangeli's avatar
      userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI · 1f1c6f07
      Andrea Arcangeli authored
      This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f1c6f07
    • Andrea Arcangeli's avatar
      userfaultfd: activate syscall · 1380fca0
      Andrea Arcangeli authored
      This activates the userfaultfd syscall.
      
      [sfr@canb.auug.org.au: activate syscall fix]
      [akpm@linux-foundation.org: don't enable userfaultfd on powerpc]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1380fca0
    • Andrea Arcangeli's avatar
      userfaultfd: buildsystem activation · a14c151e
      Andrea Arcangeli authored
      This allows to select the userfaultfd during configuration to build it.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a14c151e
    • Andrea Arcangeli's avatar
      userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read · 8d2afd96
      Andrea Arcangeli authored
      Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
      userfaultfd_read if they are run on different threads simultaneously.
      
      Until now qemu solved the race in userland: the race was explicitly
      and intentionally left for userland to solve. However we can also
      solve it in kernel.
      
      Requiring all users to solve this race if they use two threads (one
      for the background transfer and one for the userfault reads) isn't
      very attractive from an API prospective, furthermore this allows to
      remove a whole bunch of mutex and bitmap code from qemu, making it
      faster. The cost of __get_user_pages_fast should be insignificant
      considering it scales perfectly and the pagetables are already hot in
      the CPU cache, compared to the overhead in userland to maintain those
      structures.
      
      Applying this patch is backwards compatible with respect to the
      userfaultfd userland API, however reverting this change wouldn't be
      backwards compatible anymore.
      
      Without this patch qemu in the background transfer thread, has to read
      the old state, and do UFFDIO_WAKE if old_state is missing but it
      become REQUESTED by the time it tries to set it to RECEIVED (signaling
      the other side received an userfault).
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
                              postcopy_place_page
                              read old_state -> MISSING
                              UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
                                              poll() -> POLLIN
                                              read() -> 0x7fb76a139000
                                              postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
      
                              tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
                              /* check that no userfault raced with UFFDIO_COPY */
                              if (old_state == MISSING && tmp_state == REQUESTED)
                                      UFFDIO_WAKE from background thread
      
      And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
                              postcopy_place_page
                              read old_state -> MISSING
                              UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
                              tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
                                              poll() -> POLLIN
                                              read() -> 0x7fb76a139000
      
                                              if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
                                                      UFFDIO_WAKE from userfault thread
      
      This patch removes the need of both UFFDIO_WAKE and of the associated
      per-page tristate as well.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d2afd96
    • Andrea Arcangeli's avatar
      userfaultfd: allocate the userfaultfd_ctx cacheline aligned · 3004ec9c
      Andrea Arcangeli authored
      Use proper slab to guarantee alignment.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3004ec9c
    • Andrea Arcangeli's avatar
      userfaultfd: optimize read() and poll() to be O(1) · 15b726ef
      Andrea Arcangeli authored
      This makes read O(1) and poll that was already O(1) becomes lockless.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15b726ef
    • Andrea Arcangeli's avatar
      userfaultfd: wake pending userfaults · ba85c702
      Andrea Arcangeli authored
      This is an optimization but it's a userland visible one and it affects
      the API.
      
      The downside of this optimization is that if you call poll() and you
      get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
      may be waken by a different thread, before read(ufd) comes
      around. This in short means that poll() isn't really usable if the
      userfaultfd is opened in blocking mode.
      
      userfaults won't wait in "pending" state to be read anymore and any
      UFFDIO_WAKE or similar operations that has the objective of waking
      userfaults after their resolution, will wake all blocked userfaults
      for the resolved range, including those that haven't been read() by
      userland yet.
      
      The behavior of poll() becomes not standard, but this obviates the
      need of "spurious" UFFDIO_WAKE and it lets the userland threads to
      restart immediately without requiring an UFFDIO_WAKE. This is even
      more significant in case of repeated faults on the same address from
      multiple threads.
      
      This optimization is justified by the measurement that the number of
      spurious UFFDIO_WAKE accounts for 5% and 10% of the total
      userfaults for heavy workloads, so it's worth optimizing those away.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba85c702
    • Andrea Arcangeli's avatar
      userfaultfd: change the read API to return a uffd_msg · a9b85f94
      Andrea Arcangeli authored
      I had requests to return the full address (not the page aligned one) to
      userland.
      
      It's not entirely clear how the page offset could be relevant because
      userfaults aren't like SIGBUS that can sigjump to a different place and it
      actually skip resolving the fault depending on a page offset.  There's
      currently no real way to skip the fault especially because after a
      UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
      kernel without having to return to userland first (not even self modifying
      code replacing the .text that touched the faulting address would prevent
      the fault to be repeated).  Userland cannot skip repeating the fault even
      more so if the fault was triggered by a KVM secondary page fault or any
      get_user_pages or any copy-user inside some syscall which will return to
      kernel code.  The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
      to a SIGBUS being raised because the userfault can't wait if it cannot
      release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
      that).
      
      Still returning userland a proper structure during the read() on the uffd,
      can allow to use the current UFFD_API for the future non-cooperative
      extensions too and it looks cleaner as well.  Once we get additional
      fields there's no point to return the fault address page aligned anymore
      to reuse the bits below PAGE_SHIFT.
      
      The only downside is that the read() syscall will read 32bytes instead of
      8bytes but that's not going to be measurable overhead.
      
      The total number of new events that can be extended or of new future bits
      for already shipped events, is limited to 64 by the features field of the
      uffdio_api structure.  If more will be needed a bump of UFFD_API will be
      required.
      
      [akpm@linux-foundation.org: use __packed]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9b85f94
    • Pavel Emelyanov's avatar
      userfaultfd: Rename uffd_api.bits into .features · 3f602d27
      Pavel Emelyanov authored
      This is (seems to be) the minimal thing that is required to unblock
      standard uffd usage from the non-cooperative one.  Now more bits can be
      added to the features field indicating e.g.  UFFD_FEATURE_FORK and others
      needed for the latter use-case.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f602d27
    • Andrea Arcangeli's avatar
      userfaultfd: add new syscall to provide memory externalization · 86039bd3
      Andrea Arcangeli authored
      Once an userfaultfd has been created and certain region of the process
      virtual address space have been registered into it, the thread responsible
      for doing the memory externalization can manage the page faults in
      userland by talking to the kernel using the userfaultfd protocol.
      
      poll() can be used to know when there are new pending userfaults to be
      read (POLLIN).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86039bd3
    • Andrea Arcangeli's avatar
      userfaultfd: prevent khugepaged to merge if userfaultfd is armed · c1294d05
      Andrea Arcangeli authored
      If userfaultfd is armed on a certain vma we can't "fill" the holes with
      zeroes or we'll break the userland on demand paging.  The holes if the
      userfault is armed, are really missing information (not zeroes) that the
      userland has to load from network or elsewhere.
      
      The same issue happens for wrprotected ptes that we can't just convert
      into a single writable pmd_trans_huge.
      
      We could however in theory still merge across zeropages if only
      VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)...  that could be
      slightly improved but it'd be much more complex code for a tiny corner
      case.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1294d05
    • Andrea Arcangeli's avatar
      userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx · 19a809af
      Andrea Arcangeli authored
      vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
      must be aware about so that we can merge vmas back like they were
      originally before arming the userfaultfd on some memory range.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19a809af
    • Andrea Arcangeli's avatar
      userfaultfd: call handle_userfault() for userfaultfd_missing() faults · 6b251fc9
      Andrea Arcangeli authored
      This is where the page faults must be modified to call
      handle_userfault() if userfaultfd_missing() is true (so if the
      vma->vm_flags had VM_UFFD_MISSING set).
      
      handle_userfault() then takes care of blocking the page fault and
      delivering it to userland.
      
      The fault flags must also be passed as parameter so the "read|write"
      kind of fault can be passed to userland.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b251fc9
    • Andrea Arcangeli's avatar
      userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP · 16ba6f81
      Andrea Arcangeli authored
      These two flags gets set in vma->vm_flags to tell the VM common code
      if the userfaultfd is armed and in which mode (only tracking missing
      faults, only tracking wrprotect faults or both). If neither flags is
      set it means the userfaultfd is not armed on the vma.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16ba6f81
    • Andrea Arcangeli's avatar
      userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct · 745f234b
      Andrea Arcangeli authored
      This adds the vm_userfaultfd_ctx to the vm_area_struct.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      745f234b
    • Andrea Arcangeli's avatar
      userfaultfd: linux/userfaultfd_k.h · 932b18e0
      Andrea Arcangeli authored
      Kernel header defining the methods needed by the VM common code to
      interact with the userfaultfd.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      932b18e0
    • Andrea Arcangeli's avatar
      userfaultfd: uAPI · 1038628d
      Andrea Arcangeli authored
      Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1038628d
    • Andrea Arcangeli's avatar
      userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key · 51360155
      Andrea Arcangeli authored
      userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
      of the current hardcoded 1 (that would wake just the first waitqueue in
      the head list).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51360155
    • Andrea Arcangeli's avatar
      userfaultfd: linux/Documentation/vm/userfaultfd.txt · 25edd8bf
      Andrea Arcangeli authored
      This is the latest userfaultfd patchset.  The postcopy live migration
      feature on the qemu side is mostly ready to be merged and it entirely
      depends on the userfaultfd syscall to be merged as well.  So it'd be great
      if this patchset could be reviewed for merging in -mm.
      
      Userfaults allow to implement on demand paging from userland and more
      generally they allow userland to more efficiently take control of the
      behavior of page faults than what was available before (PROT_NONE +
      SIGSEGV trap).
      
      The use cases are:
      
      1) KVM postcopy live migration (one form of cloud memory
         externalization).
      
         KVM postcopy live migration is the primary driver of this work:
      
          http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
          http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html
      
      2) postcopy live migration of binaries inside linux containers:
      
          http://thread.gmane.org/gmane.linux.kernel.mm/132662
      
      3) KVM postcopy live snapshotting (allowing to limit/throttle the
         memory usage, unlike fork would, plus the avoidance of fork
         overhead in the first place).
      
         While the wrprotect tracking is not implemented yet, the syscall API is
         already contemplating the wrprotect fault tracking and it's generic enough
         to allow its later implementation in a backwards compatible fashion.
      
      4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
         should be extended to work also on tmpfs and then the
         uffdio_register.ioctls will notify userland that UFFDIO_COPY is
         available even when the registered virtual memory range is tmpfs
         backed.
      
      5) alternate mechanism to notify web browsers or apps on embedded
         devices that volatile pages have been reclaimed. This basically
         avoids the need to run a syscall before the app can access with the
         CPU the virtual regions marked volatile. This depends on point 4)
         to be fulfilled first, as volatile pages happily apply to tmpfs.
      
      Even though there wasn't a real use case requesting it yet, it also
      allows to implement distributed shared memory in a way that readonly
      shared mappings can exist simultaneously in different hosts and they
      can be become exclusive at the first wrprotect fault.
      
      This patch (of 22):
      
      Add documentation.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25edd8bf
    • Daniel Borkmann's avatar
      mm/slab.h: fix argument order in cache_from_obj's error message · 2d16e0fd
      Daniel Borkmann authored
      While debugging a networking issue, I hit a condition that triggered an
      object to be freed into the wrong kmem cache, and thus triggered the
      warning in cache_from_obj().
      
      The arguments in the error message are in wrong order: the location
      of the object's kmem cache is in cachep, not s.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d16e0fd
    • Joonsoo Kim's avatar
      mm/slub: don't wait for high-order page allocation · 45eb00cd
      Joonsoo Kim authored
      Description is almost copied from commit fb05e7a8 ("net: don't wait
      for order-3 page allocation").
      
      I saw excessive direct memory reclaim/compaction triggered by slub.  This
      causes performance issues and add latency.  Slub uses high-order
      allocation to reduce internal fragmentation and management overhead.  But,
      direct memory reclaim/compaction has high overhead and the benefit of
      high-order allocation can't compensate the overhead of both work.
      
      This patch makes auxiliary high-order allocation atomic.  If there is no
      memory pressure and memory isn't fragmented, the alloction will still
      success, so we don't sacrifice high-order allocation's benefit here.  If
      the atomic allocation fails, direct memory reclaim/compaction will not be
      triggered, allocation fallback to low-order immediately, hence the direct
      memory reclaim/compaction overhead is avoided.  In the allocation failure
      case, kswapd is waken up and trying to make high-order freepages, so
      allocation could success next time.
      
      Following is the test to measure effect of this patch.
      
      System: QEMU, CPU 8, 512 MB
      Mem: 25% memory is allocated at random position to make fragmentation.
       Memory-hogger occupies 150 MB memory.
      Workload: hackbench -g 20 -l 1000
      
      Average result by 10 runs (Base va Patched)
      
      elapsed_time(s): 4.3468 vs 2.9838
      compact_stall: 461.7 vs 73.6
      pgmigrate_success: 28315.9 vs 7256.1
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45eb00cd
    • Konstantin Khlebnikov's avatar
      mm/slub: fix slab double-free in case of duplicate sysfs filename · 80da026a
      Konstantin Khlebnikov authored
      sysfs_slab_add() shouldn't call kobject_put at error path: this puts last
      reference of kmem-cache kobject and frees it.  Kmem cache will be freed
      second time at error path in kmem_cache_create().
      
      For example this happens when slub debug was enabled in runtime and
      somebody creates new kmem cache:
      
      # echo 1 | tee /sys/kernel/slab/*/sanity_checks
      # modprobe configfs
      
      "configfs_dir_cache" cannot be merged because existing slab have debug and
      cannot create new slab because unique name ":t-0000096" already taken.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80da026a
    • Thomas Gleixner's avatar
      mm/slub: move slab initialization into irq enabled region · 588f8ba9
      Thomas Gleixner authored
      Initializing a new slab can introduce rather large latencies because most
      of the initialization runs always with interrupts disabled.
      
      There is no point in doing so.  The newly allocated slab is not visible
      yet, so there is no reason to protect it against concurrent alloc/free.
      
      Move the expensive parts of the initialization into allocate_slab(), so
      for all allocations with GFP_WAIT set, interrupts are enabled.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      588f8ba9
    • Jesper Dangaard Brouer's avatar
      slub: add support for kmem_cache_debug in bulk calls · 3eed034d
      Jesper Dangaard Brouer authored
      Per request of Joonsoo Kim adding kmem debug support.
      
      I've tested that when debugging is disabled, then there is almost no
      performance impact as this code basically gets removed by the compiler.
      
      Need some guidance in enabling and testing this.
      
      bulk- PREVIOUS                  - THIS-PATCH
        1 -  43 cycles(tsc) 10.811 ns -  44 cycles(tsc) 11.236 ns  improved  -2.3%
        2 -  27 cycles(tsc)  6.867 ns -  28 cycles(tsc)  7.019 ns  improved  -3.7%
        3 -  21 cycles(tsc)  5.496 ns -  22 cycles(tsc)  5.526 ns  improved  -4.8%
        4 -  24 cycles(tsc)  6.038 ns -  19 cycles(tsc)  4.786 ns  improved  20.8%
        8 -  17 cycles(tsc)  4.280 ns -  18 cycles(tsc)  4.572 ns  improved  -5.9%
       16 -  17 cycles(tsc)  4.483 ns -  18 cycles(tsc)  4.658 ns  improved  -5.9%
       30 -  18 cycles(tsc)  4.531 ns -  18 cycles(tsc)  4.568 ns  improved   0.0%
       32 -  58 cycles(tsc) 14.586 ns -  65 cycles(tsc) 16.454 ns  improved -12.1%
       34 -  53 cycles(tsc) 13.391 ns -  63 cycles(tsc) 15.932 ns  improved -18.9%
       48 -  65 cycles(tsc) 16.268 ns -  50 cycles(tsc) 12.506 ns  improved  23.1%
       64 -  53 cycles(tsc) 13.440 ns -  63 cycles(tsc) 15.929 ns  improved -18.9%
      128 -  79 cycles(tsc) 19.899 ns -  86 cycles(tsc) 21.583 ns  improved  -8.9%
      158 -  90 cycles(tsc) 22.732 ns -  90 cycles(tsc) 22.552 ns  improved   0.0%
      250 -  95 cycles(tsc) 23.916 ns -  98 cycles(tsc) 24.589 ns  improved  -3.2%
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eed034d