1. 01 Aug, 2012 40 commits
    • Tim Chen's avatar
      memcg: gix memory accounting scalability in shrink_page_list · 69980e31
      Tim Chen authored
      I noticed in a multi-process parallel files reading benchmark I ran on a 8
      socket machine, throughput slowed down by a factor of 8 when I ran the
      benchmark within a cgroup container.  I traced the problem to the
      following code path (see below) when we are trying to reclaim memory from
      file cache.  The res_counter_uncharge function is called on every page
      that's reclaimed and created heavy lock contention.  The patch below
      allows the reclaimed pages to be uncharged from the resource counter in
      batch and recovered the regression.
      
      Tim
      
           40.67%           usemem  [kernel.kallsyms]                   [k] _raw_spin_lock
                            |
                            --- _raw_spin_lock
                               |
                               |--92.61%-- res_counter_uncharge
                               |          |
                               |          |--100.00%-- __mem_cgroup_uncharge_common
                               |          |          |
                               |          |          |--100.00%-- mem_cgroup_uncharge_cache_page
                               |          |          |          __remove_mapping
                               |          |          |          shrink_page_list
                               |          |          |          shrink_inactive_list
                               |          |          |          shrink_mem_cgroup_zone
                               |          |          |          shrink_zone
                               |          |          |          do_try_to_free_pages
                               |          |          |          try_to_free_pages
                               |          |          |          __alloc_pages_nodemask
                               |          |          |          alloc_pages_current
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69980e31
    • Gavin Shan's avatar
      mm/sparse: remove index_init_lock · c1c95183
      Gavin Shan authored
      sparse_index_init() uses the index_init_lock spinlock to protect root
      mem_section assignment.  The lock is not necessary anymore because the
      function is called only during boot (during paging init which is executed
      only from a single CPU) and from the hotplug code (by add_memory() via
      arch_add_memory()) which uses mem_hotplug_mutex.
      
      The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
      preparation") and sparse_index_init() was used only during boot at that
      time.
      
      Later when the hotplug code (and add_memory()) was introduced there was no
      synchronization so it was possible to online more sections from the same
      root probably (though I am not 100% sure about that).  The first
      synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
      hibernation in the same kernel") which was later replaced by the
      mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
      {un}lock_memory_hotplug()").
      
      Let's remove the lock as it is not needed and it makes the code more
      confusing.
      
      [mhocko@suse.cz: changelog]
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1c95183
    • Gavin Shan's avatar
      mm/sparse: more checks on mem_section number · db36a461
      Gavin Shan authored
      __section_nr() was implemented to retrieve the corresponding memory
      section number according to its descriptor.  It's possible that the
      specified memory section descriptor doesn't exist in the global array.  So
      add more checking on that and report an error for a wrong case.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db36a461
    • Gavin Shan's avatar
      mm/sparse: optimize sparse_index_alloc · 5b760e64
      Gavin Shan authored
      With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
      descriptors are allocated from slab or bootmem.  When allocating from
      slab, let slab/bootmem allocator clear the memory chunk.  We needn't clear
      it explicitly.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b760e64
    • Wanpeng Li's avatar
      memcg: add mem_cgroup_from_css() helper · b2145145
      Wanpeng Li authored
      Add a mem_cgroup_from_css() helper to replace open-coded invokations of
      container_of().  To clarify the code and to add a little more type safety.
      
      [akpm@linux-foundation.org: fix extensive breakage]
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2145145
    • Hugh Dickins's avatar
      memcg: further prevent OOM with too many dirty pages · c3b94f44
      Hugh Dickins authored
      The may_enter_fs test turns out to be too restrictive: though I saw no
      problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
      on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
      slightly changed the way I started off the testing: dd if=/dev/zero
      of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
      memory.limit_in_bytes cgroup to ext4 on USB stick.
      
      ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
      AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
      the transaction needs to be started even before allocating pagecache
      memory.  But it may not be worth worrying about these days: if direct
      reclaim avoids FS writeback, does __GFP_FS now mean anything?
      
      Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
      device; but since that also masks off __GFP_IO, we can test for __GFP_IO
      directly, ignoring may_enter_fs and __GFP_FS.
      
      But even so, the test still OOMs sometimes: when originally testing on
      3.5-rc6, it OOMed about one time in five or ten; when testing just now on
      3.5-rc6-mm1, it OOMed on the first iteration.
      
      This residual problem comes from an accumulation of pages under ordinary
      writeback, not marked PageReclaim, so rightly not causing the memcg check
      to wait on their writeback: these too can prevent shrink_page_list() from
      freeing any pages, so many times that memcg reclaim fails and OOMs.
      
      Deal with these in the same way as direct reclaim now deals with dirty FS
      pages: mark them PageReclaim.  It is appropriate to rotate these to tail
      of list when writepage completes, but more importantly, the PageReclaim
      flag makes memcg reclaim wait on them if encountered again.  Increment
      NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
      
      Setting PageReclaim here may occasionally race with end_page_writeback()
      clearing it: lru_deactivate_fn() already faced the same race, and
      correctly concluded that the window is small and the issue non-critical.
      
      With these changes, the test runs indefinitely without OOMing on ext4,
      ext3 and ext2: I'll move on to test with other filesystems later.
      
      Trivia: invert conditions for a clearer block without an else, and goto
      keep_locked to do the unlock_page.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3b94f44
    • Michal Hocko's avatar
      memcg: prevent OOM with too many dirty pages · e62e384e
      Michal Hocko authored
      The current implementation of dirty pages throttling is not memcg aware
      which makes it easy to have memcg LRUs full of dirty pages.  Without
      throttling, these LRUs can be scanned faster than the rate of writeback,
      leading to memcg OOM conditions when the hard limit is small.
      
      This patch fixes the problem by throttling the allocating process
      (possibly a writer) during the hard limit reclaim by waiting on
      PageReclaim pages.  We are waiting only for PageReclaim pages because
      those are the pages that made one full round over LRU and that means that
      the writeback is much slower than scanning.
      
      The solution is far from being ideal - long term solution is memcg aware
      dirty throttling - but it is meant to be a band aid until we have a real
      fix.  We are seeing this happening during nightly backups which are placed
      into containers to prevent from eviction of the real working set.
      
      The change affects only memcg reclaim and only when we encounter
      PageReclaim pages which is a signal that the reclaim doesn't catch up on
      with the writers so somebody should be throttled.  This could be
      potentially unfair because it could be somebody else from the group who
      gets throttled on behalf of the writer but as writers need to allocate as
      well and they allocate in higher rate the probability that only innocent
      processes would be penalized is not that high.
      
      I have tested this change by a simple dd copying /dev/zero to tmpfs or
      ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
      containers) and dd got killed by OOM killer every time.  With the patch I
      could run the dd with the same size under 5M controller without any OOM.
      The issue is more visible with slower devices for output.
      
      * With the patch
      ================
      * tmpfs size=2G
      ---------------
      $ vim cgroup_cache_oom_test.sh
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s
      
      * ext3
      ------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s
      
      * Without the patch
      ===================
      * tmpfs size=2G
      ---------------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      ./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s
      
      * ext3
      ------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      ./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      ./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s
      
      [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
      [hughd@google.com: fix deadlock with loop driver]
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e62e384e
    • Xiao Guangrong's avatar
      mm: mmu_notifier: fix freed page still mapped in secondary MMU · 3ad3d901
      Xiao Guangrong authored
      mmu_notifier_release() is called when the process is exiting.  It will
      delete all the mmu notifiers.  But at this time the page belonging to the
      process is still present in page tables and is present on the LRU list, so
      this race will happen:
      
            CPU 0                 CPU 1
      mmu_notifier_release:    try_to_unmap:
         hlist_del_init_rcu(&mn->hlist);
                                  ptep_clear_flush_notify:
                                        mmu nofifler not found
                                  free page  !!!!!!
                                  /*
                                   * At the point, the page has been
                                   * freed, but it is still mapped in
                                   * the secondary MMU.
                                   */
      
        mn->ops->release(mn, mm);
      
      Then the box is not stable and sometimes we can get this bug:
      
      [  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
      [  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
      [  738.075936] page flags: 0x20000000000014(referenced|dirty)
      
      The same issue is present in mmu_notifier_unregister().
      
      We can call ->release before deleting the notifier to ensure the page has
      been unmapped from the secondary MMU before it is freed.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ad3d901
    • Johannes Weiner's avatar
      mm: memcg: only check anon swapin page charges for swap cache · bdf4f4d2
      Johannes Weiner authored
      shmem knows for sure that the page is in swap cache when attempting to
      charge a page, because the cache charge entry function has a check for it.
      Only anon pages may be removed from swap cache already when trying to
      charge their swapin.
      
      Adjust the comment, though: '4969c119 mm: fix swapin race condition' added
      a stable PageSwapCache check under the page lock in the do_swap_page()
      before calling the memory controller, so it's unuse_pte()'s pte_same()
      that may fail.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdf4f4d2
    • Johannes Weiner's avatar
      mm: memcg: only check swap cache pages for repeated charging · 90deb788
      Johannes Weiner authored
      Only anon and shmem pages in the swap cache are attempted to be charged
      multiple times, from every swap pte fault or from shmem_unuse().  No other
      pages require checking PageCgroupUsed().
      
      Charging pages in the swap cache is also serialized by the page lock, and
      since both the try_charge and commit_charge are called under the same page
      lock section, the PageCgroupUsed() check might as well happen before the
      counter charging, let alone reclaim.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90deb788
    • Johannes Weiner's avatar
      mm: memcg: split swapin charge function into private and public part · 0435a2fd
      Johannes Weiner authored
      When shmem is charged upon swapin, it does not need to check twice whether
      the memory controller is enabled.
      
      Also, shmem pages do not have to be checked for everything that regular
      anon pages have to be checked for, so let shmem use the internal version
      directly and allow future patches to move around checks that are only
      required when swapping in anon pages.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0435a2fd
    • Johannes Weiner's avatar
      mm: memcg: remove needless !mm fixup to init_mm when charging · 24467cac
      Johannes Weiner authored
      It does not matter to __mem_cgroup_try_charge() if the passed mm is NULL
      or init_mm, it will charge the root memcg in either case.
      
      Also fix up the comment in __mem_cgroup_try_charge() that claimed the
      init_mm would be charged when no mm was passed.  It's not really
      incorrect, but confusing.  Clarify that the root memcg is charged in this
      case.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24467cac
    • Johannes Weiner's avatar
      mm: memcg: remove unneeded shmem charge type · 62ba7442
      Johannes Weiner authored
      shmem page charges have not needed a separate charge type to tell them
      from regular file pages since 08e552c6 ("memcg: synchronized LRU").
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62ba7442
    • Johannes Weiner's avatar
      mm: memcg: move swapin charge functions above callsites · 827a03d2
      Johannes Weiner authored
      Charging cache pages may require swapin in the shmem case.  Save the
      forward declaration and just move the swapin functions above the cache
      charging functions.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      827a03d2
    • Johannes Weiner's avatar
      mm: memcg: only check for PageSwapCache when uncharging anon · 7d188958
      Johannes Weiner authored
      Only anon pages that are uncharged at the time of the last page table
      mapping vanishing may be in swapcache.
      
      When shmem pages, file pages, swap-freed anon pages, or just migrated
      pages are uncharged, they are known for sure to be not in swapcache.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d188958
    • Johannes Weiner's avatar
      mm: memcg: push down PageSwapCache check into uncharge entry functions · 0c59b89c
      Johannes Weiner authored
      Not all uncharge paths need to check if the page is swapcache, some of
      them can know for sure.
      
      Push down the check into all callsites of uncharge_common() so that the
      patch that removes some of them is more obvious.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c59b89c
    • Johannes Weiner's avatar
      mm: swapfile: clean up unuse_pte race handling · 5d84c776
      Johannes Weiner authored
      The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when
      the function would continue to reestablish the page even after
      mem_cgroup_try_charge_swapin() failed.  After 85d9fc89 "memcg: fix refcnt
      handling at swapoff", the condition is always true when this code is
      reached.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d84c776
    • Johannes Weiner's avatar
      mm: memcg: fix compaction/migration failing due to memcg limits · 0030f535
      Johannes Weiner authored
      Compaction (and page migration in general) can currently be hindered
      through pages being owned by memory cgroups that are at their limits and
      unreclaimable.
      
      The reason is that the replacement page is being charged against the limit
      while the page being replaced is also still charged.  But this seems
      unnecessary, given that only one of the two pages will still be in use
      after migration finishes.
      
      This patch changes the memcg migration sequence so that the replacement
      page is not charged.  Whatever page is still in use after successful or
      failed migration gets to keep the charge of the page that was going to be
      replaced.
      
      The replacement page will still show up temporarily in the rss/cache
      statistics, this can be fixed in a later patch as it's less urgent.
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0030f535
    • Mel Gorman's avatar
      swapfile: avoid dereferencing bd_disk during swap_entry_free for network storage · 73744923
      Mel Gorman authored
      Commit b3a27d ("swap: Add swap slot free callback to
      block_device_operations") dereferences p->bdev->bd_disk but this is a NULL
      dereference if using swap-over-NFS.  This patch checks SWP_BLKDEV on the
      swap_info_struct before dereferencing.
      
      With reference to this callback, Christoph Hellwig stated "Please just
      remove the callback entirely.  It has no user outside the staging tree and
      was added clearly against the rules for that staging tree".  This would
      also be my preference but there was not an obvious way of keeping zram in
      staging/ happy.
      Signed-off-by: default avatarXiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73744923
    • Mel Gorman's avatar
      nfs: prevent page allocator recursions with swap over NFS. · 192e501b
      Mel Gorman authored
      GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
      just not of any filesystem data.
      
      The problem is that previously NOFS was correct because that avoids
      recursion into the NFS code.  With swap-over-NFS, it is no longer correct
      as swap IO can lead to this recursion.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      192e501b
    • Mel Gorman's avatar
      nfs: enable swap on NFS · a564b8f0
      Mel Gorman authored
      Implement the new swapfile a_ops for NFS and hook up ->direct_IO.  This
      will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under
      PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol
      ->connect() method.
      
      PF_MEMALLOC should allow the allocation of struct socket and related
      objects and the early (re)setting of SOCK_MEMALLOC should allow us to
      receive the packets required for the TCP connection buildup.
      
      [jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases]
      [dfeng@redhat.com: Fix handling of multiple swap files]
      [a.p.zijlstra@chello.nl: Original patch]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a564b8f0
    • Mel Gorman's avatar
      nfs: disable data cache revalidation for swapfiles · 29418aa4
      Mel Gorman authored
      The VM does not like PG_private set on PG_swapcache pages.  As suggested
      by Trond in http://lkml.org/lkml/2006/8/25/348, this patch disables NFS
      data cache revalidation on swap files.  as it does not make sense to have
      other clients change the file while it is being used as swap.  This avoids
      setting PG_private on swap pages, since there ought to be no further races
      with invalidate_inode_pages2() to deal with.
      
      Since we cannot set PG_private we cannot use page->private which is
      already used by PG_swapcache pages to store the nfs_page.  Thus augment
      the new nfs_page_find_request logic.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29418aa4
    • Mel Gorman's avatar
      nfs: teach the NFS client how to treat PG_swapcache pages · d56b4ddf
      Mel Gorman authored
      Replace all relevant occurences of page->index and page->mapping in the
      NFS client with the new page_file_index() and page_file_mapping()
      functions.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d56b4ddf
    • Mel Gorman's avatar
      mm: add support for direct_IO to highmem pages · 5a178119
      Mel Gorman authored
      The patch "mm: add support for a filesystem to activate swap files and use
      direct_IO for writing swap pages" added support for using direct_IO to
      write swap pages but it is insufficient for highmem pages.
      
      To support highmem pages, this patch kmaps() the page before calling the
      direct_IO() handler.  As direct_IO deals with virtual addresses an
      additional helper is necessary for get_kernel_pages() to lookup the struct
      page for a kmap virtual address.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a178119
    • Mel Gorman's avatar
      mm: swap: implement generic handler for swap_activate · a509bc1a
      Mel Gorman authored
      The version of swap_activate introduced is sufficient for swap-over-NFS
      but would not provide enough information to implement a generic handler.
      This patch shuffles things slightly to ensure the same information is
      available for aops->swap_activate() as is available to the core.
      
      No functionality change.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a509bc1a
    • Mel Gorman's avatar
      mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages · 62c230bc
      Mel Gorman authored
      Currently swapfiles are managed entirely by the core VM by using ->bmap to
      allocate space and write to the blocks directly.  This effectively ensures
      that the underlying blocks are allocated and avoids the need for the swap
      subsystem to locate what physical blocks store offsets within a file.
      
      If the swap subsystem is to use the filesystem information to locate the
      blocks, it is critical that information such as block groups, block
      bitmaps and the block descriptor table that map the swap file were
      resident in memory.  This patch adds address_space_operations that the VM
      can call when activating or deactivating swap backed by a file.
      
        int swap_activate(struct file *);
        int swap_deactivate(struct file *);
      
      The ->swap_activate() method is used to communicate to the file that the
      VM relies on it, and the address_space should take adequate measures such
      as reserving space in the underlying device, reserving memory for mempools
      and pinning information such as the block descriptor table in memory.  The
      ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
      returned success.
      
      After a successful swapfile ->swap_activate, the swapfile is marked
      SWP_FILE and swapper_space.a_ops will proxy to
      sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
      pages and ->readpage to read.
      
      It is perfectly possible that direct_IO be used to read the swap pages but
      it is an unnecessary complication.  Similarly, it is possible that
      ->writepage be used instead of direct_io to write the pages but filesystem
      developers have stated that calling writepage from the VM is undesirable
      for a variety of reasons and using direct_IO opens up the possibility of
      writing back batches of swap pages in the future.
      
      [a.p.zijlstra@chello.nl: Original patch]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62c230bc
    • Mel Gorman's avatar
      mm: add get_kernel_page[s] for pinning of kernel addresses for I/O · 18022c5d
      Mel Gorman authored
      This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
      may be used to pin a vector of kernel addresses for IO.  The initial user
      is expected to be NFS for allowing pages to be written to swap using
      aops->direct_IO().  Strictly speaking, swap-over-NFS only needs to pin one
      page for IO but it makes sense to express the API in terms of a vector and
      add a helper for pinning single pages.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18022c5d
    • Mel Gorman's avatar
      mm: methods for teaching filesystems about PG_swapcache pages · f981c595
      Mel Gorman authored
      In order to teach filesystems to handle swap cache pages, three new page
      functions are introduced:
      
        pgoff_t page_file_index(struct page *);
        loff_t page_file_offset(struct page *);
        struct address_space *page_file_mapping(struct page *);
      
      page_file_index() - gives the offset of this page in the file in
      PAGE_CACHE_SIZE blocks.  Like page->index is for mapped pages, this
      function also gives the correct index for PG_swapcache pages.
      
      page_file_offset() - uses page_file_index(), so that it will give the
      expected result, even for PG_swapcache pages.
      
      page_file_mapping() - gives the mapping backing the actual page; that is
      for swap cache pages it will give swap_file->f_mapping.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f981c595
    • Mel Gorman's avatar
      selinux: tag avc cache alloc as non-critical · 6290c2c4
      Mel Gorman authored
      Failing to allocate a cache entry will only harm performance not
      correctness.  Do not consume valuable reserve pages for something like
      that.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6290c2c4
    • Mel Gorman's avatar
      netvm: prevent a stream-specific deadlock · c76562b6
      Mel Gorman authored
      This patch series is based on top of "Swap-over-NBD without deadlocking
      v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
      
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  In diskless systems this is not an option so if swap if
      required then swapping over the network is considered.  The two likely
      scenarios are when blade servers are used as part of a cluster where the
      form factor or maintenance costs do not allow the use of disks and thin
      clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap but this is not always an option.  There is no
      guarantee that the network attached storage (NAS) device is running Linux
      or supports NBD.  However, it is likely that it supports NFS so there are
      users that want support for swapping over NFS despite any performance
      concern.  Some distributions currently carry patches that support swapping
      over NFS but it would be preferable to support it in the mainline kernel.
      
      Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
      
      Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
      	reserves.
      
      Patch 3 adds three helpers for filesystems to handle swap cache pages.
      	For example, page_file_mapping() returns page->mapping for
      	file-backed pages and the address_space of the underlying
      	swap file for swap cache pages.
      
      Patch 4 adds two address_space_operations to allow a filesystem
      	to pin all metadata relevant to a swapfile in memory. Upon
      	successful activation, the swapfile is marked SWP_FILE and
      	the address space operation ->direct_IO is used for writing
      	and ->readpage for reading in swap pages.
      
      Patch 5 notes that patch 3 is bolting
      	filesystem-specific-swapfile-support onto the side and that
      	the default handlers have different information to what
      	is available to the filesystem. This patch refactors the
      	code so that there are generic handlers for each of the new
      	address_space operations.
      
      Patch 6 adds an API to allow a vector of kernel addresses to be
      	translated to struct pages and pinned for IO.
      
      Patch 7 adds support for using highmem pages for swap by kmapping
      	the pages before calling the direct_IO handler.
      
      Patch 8 updates NFS to use the helpers from patch 3 where necessary.
      
      Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
      
      Patch 10 implements the new swapfile-related address_space operations
      	for NFS and teaches the direct IO handler how to manage
      	kernel addresses.
      
      Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
      	where appropriate.
      
      Patch 12 fixes a NULL pointer dereference that occurs when using
      	swap-over-NFS.
      
      With the patches applied, it is possible to mount a swapfile that is on an
      NFS filesystem.  Swap performance is not great with a swap stress test
      taking roughly twice as long to complete than if the swap device was
      backed by NBD.
      
      This patch: netvm: prevent a stream-specific deadlock
      
      It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
      that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
      buffers from receiving data, which will prevent userspace from running,
      which is needed to reduce the buffered data.
      
      Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
      this change it applied, it is important that sockets that set
      SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
      If this happens, a warning is generated and the tokens reclaimed to avoid
      accounting errors until the bug is fixed.
      
      [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c76562b6
    • Mel Gorman's avatar
      mm: account for the number of times direct reclaimers get throttled · 68243e76
      Mel Gorman authored
      Under significant pressure when writing back to network-backed storage,
      direct reclaimers may get throttled.  This is expected to be a short-lived
      event and the processes get woken up again but processes do get stalled.
      This patch counts how many times such stalling occurs.  It's up to the
      administrator whether to reduce these stalls by increasing
      min_free_kbytes.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68243e76
    • Mel Gorman's avatar
      mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is... · 5515061d
      Mel Gorman authored
      mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
      
      If swap is backed by network storage such as NBD, there is a risk that a
      large number of reclaimers can hang the system by consuming all
      PF_MEMALLOC reserves.  To avoid these hangs, the administrator must tune
      min_free_kbytes in advance which is a bit fragile.
      
      This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
      are in use.  If the system is routinely getting throttled the system
      administrator can increase min_free_kbytes so degradation is smoother but
      the system will keep running.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5515061d
    • Mel Gorman's avatar
      nbd: set SOCK_MEMALLOC for access to PFMEMALLOC reserves · 7f338fe4
      Mel Gorman authored
      Set SOCK_MEMALLOC on the NBD socket to allow access to PFMEMALLOC reserves
      so pages backed by NBD, particularly if swap related, can be cleaned to
      prevent the machine being deadlocked.  It is still possible that the
      PFMEMALLOC reserves get depleted resulting in deadlock but this can be
      resolved by the administrator by increasing min_free_kbytes.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f338fe4
    • Mel Gorman's avatar
      mm: micro-optimise slab to avoid a function call · 381760ea
      Mel Gorman authored
      Getting and putting objects in SLAB currently requires a function call but
      the bulk of the work is related to PFMEMALLOC reserves which are only
      consumed when network-backed storage is critical.  Use an inline function
      to determine if the function call is required.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      381760ea
    • Mel Gorman's avatar
      netvm: set PF_MEMALLOC as appropriate during SKB processing · b4b9e355
      Mel Gorman authored
      In order to make sure pfmemalloc packets receive all memory needed to
      proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
      This is limited to a subset of protocols that are expected to be used for
      writing to swap.  Taps are not allowed to use PF_MEMALLOC as these are
      expected to communicate with userspace processes which could be paged out.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [jslaby@suse.cz: Lock imbalance fix]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4b9e355
    • Mel Gorman's avatar
      netvm: propagate page->pfmemalloc from skb_alloc_page to skb · 0614002b
      Mel Gorman authored
      The skb->pfmemalloc flag gets set to true iff during the slab allocation
      of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If
      page splitting is used, it is possible that pages will be allocated from
      the PFMEMALLOC reserve without propagating this information to the skb.
      This patch propagates page->pfmemalloc from pages allocated for fragments
      to the skb.
      
      It works by reintroducing and expanding the skb_alloc_page() API to take
      an skb.  If the page was allocated from pfmemalloc reserves, it is
      automatically copied.  If the driver allocates the page before the skb, it
      should call skb_propagate_pfmemalloc() after the skb is allocated to
      ensure the flag is copied properly.
      
      Failure to do so is not critical.  The resulting driver may perform slower
      if it is used for swap-over-NBD or swap-over-NFS but it should not result
      in failure.
      
      [davem@davemloft.net: API rename and consistency]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0614002b
    • Mel Gorman's avatar
      netvm: propagate page->pfmemalloc to skb · c48a11c7
      Mel Gorman authored
      The skb->pfmemalloc flag gets set to true iff during the slab allocation
      of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
      packet is fragmented, it is possible that pages will be allocated from the
      PFMEMALLOC reserve without propagating this information to the skb.  This
      patch propagates page->pfmemalloc from pages allocated for fragments to
      the skb.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c48a11c7
    • Mel Gorman's avatar
      netvm: allow skb allocation to use PFMEMALLOC reserves · c93bdd0e
      Mel Gorman authored
      Change the skb allocation API to indicate RX usage and use this to fall
      back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
      reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
      reserve and the socket is later found to be unrelated to page reclaim, the
      packet is dropped so that the memory remains available for page reclaim.
      Network protocols are expected to recover from this packet loss.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [davem@davemloft.net: Use static branches, coding style corrections]
      [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c93bdd0e
    • Mel Gorman's avatar
      netvm: allow the use of __GFP_MEMALLOC by specific sockets · 7cb02404
      Mel Gorman authored
      Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
      for their allocations.  These sockets will be able to go below watermarks
      and allocate from the emergency reserve.  Such sockets are to be used to
      service the VM (iow.  to swap over).  They must be handled kernel side,
      exposing such a socket to user-space is a bug.
      
      There is a risk that the reserves be depleted so for now, the
      administrator is responsible for increasing min_free_kbytes as necessary
      to prevent deadlock for their workloads.
      
      [a.p.zijlstra@chello.nl: Original patches]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cb02404
    • Mel Gorman's avatar
      net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket · 99a1dec7
      Mel Gorman authored
      Introduce sk_gfp_atomic(), this function allows to inject sock specific
      flags to each sock related allocation.  It is only used on allocation
      paths that may be required for writing pages back to network storage.
      
      [davem@davemloft.net: Use sk_gfp_atomic only when necessary]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99a1dec7