1. 13 Oct, 2014 17 commits
  2. 01 Oct, 2014 1 commit
  3. 26 Sep, 2014 22 commits
    • Julian Anastasov's avatar
      ipvs: fix ipv6 hook registration for local replies · f68c8836
      Julian Anastasov authored
      commit eb90b0c7 upstream.
      
      commit fc604767
      ("ipvs: changes for local real server") from 2.6.37
      introduced DNAT support to local real server but the
      IPv6 LOCAL_OUT handler ip_vs_local_reply6() is
      registered incorrectly as IPv4 hook causing any outgoing
      IPv4 traffic to be dropped depending on the IP header values.
      
      Chris tracked down the problem to CONFIG_IP_VS_IPV6=y
      Bug report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349768Reported-by: default avatarChris J Arges <chris.j.arges@canonical.com>
      Tested-by: default avatarChris J Arges <chris.j.arges@canonical.com>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f68c8836
    • Alex Gartrell's avatar
      ipvs: Maintain all DSCP and ECN bits for ipv6 tun forwarding · 1369e886
      Alex Gartrell authored
      commit 76f084bc upstream.
      
      Previously, only the four high bits of the tclass were maintained in the
      ipv6 case.  This matches the behavior of ipv4, though whether or not we
      should reflect ECN bits may be up for debate.
      Signed-off-by: default avatarAlex Gartrell <agartrell@fb.com>
      Acked-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1369e886
    • Julian Anastasov's avatar
      ipvs: avoid netns exit crash on ip_vs_conn_drop_conntrack · 46b97394
      Julian Anastasov authored
      commit 2627b7e1 upstream.
      
      commit 8f4e0a18 ("IPVS netns exit causes crash in conntrack")
      added second ip_vs_conn_drop_conntrack call instead of just adding
      the needed check. As result, the first call still can cause
      crash on netns exit. Remove it.
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarHans Schillstrom <hans@schillstrom.com>
      Signed-off-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      46b97394
    • Michal Hocko's avatar
      mm: new_vma_page() cannot see NULL vma for hugetlb pages · 60f09ace
      Michal Hocko authored
      commit cc81717e upstream.
      
      Commit 11c731e8 ("mm/mempolicy: fix !vma in new_vma_page()") has
      removed BUG_ON(!vma) from new_vma_page which is partially correct
      because page_address_in_vma will return EFAULT for non-linear mappings
      and at least shared shmem might be mapped this way.
      
      The patch also tried to prevent NULL ptr for hugetlb pages which is not
      correct AFAICS because hugetlb pages cannot be mapped as VM_NONLINEAR
      and other conditions in page_address_in_vma seem to be legit and catch
      real bugs.
      
      This patch restores BUG_ON for PageHuge to catch potential issues when
      the to-be-migrated page is not setup properly.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      60f09ace
    • Wanpeng Li's avatar
      mm/mempolicy: fix !vma in new_vma_page() · 7d97f39b
      Wanpeng Li authored
      commit 11c731e8 upstream.
      
      BUG_ON(!vma) assumption is introduced by commit 0bf598d8 ("mbind:
      add BUG_ON(!vma) in new_vma_page()"), however, even if
      
          address = __vma_address(page, vma);
      
      and
      
          vma->start < address < vma->end
      
      page_address_in_vma() may still return -EFAULT because of many other
      conditions in it.  As a result the while loop in new_vma_page() may end
      with vma=NULL.
      
      This patch revert the commit and also fix the potential dereference NULL
      pointer reported by Dan.
      
         http://marc.info/?l=linux-mm&m=137689530323257&w=2
      
        kernel BUG at mm/mempolicy.c:1204!
        invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU: 3 PID: 7056 Comm: trinity-child3 Not tainted 3.13.0-rc3+ #2
        task: ffff8801ca5295d0 ti: ffff88005ab20000 task.ti: ffff88005ab20000
        RIP: new_vma_page+0x70/0x90
        RSP: 0000:ffff88005ab21db0  EFLAGS: 00010246
        RAX: fffffffffffffff2 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000008040075 RSI: ffff8801c3d74600 RDI: ffffea00079a8b80
        RBP: ffff88005ab21dc8 R08: 0000000000000004 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffff2
        R13: ffffea00079a8b80 R14: 0000000000400000 R15: 0000000000400000
      
        FS:  00007ff49c6f4740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007ff49c68f994 CR3: 000000005a205000 CR4: 00000000001407e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Stack:
         ffffea00079a8b80 ffffea00079a8bc0 ffffea00079a8ba0 ffff88005ab21e50
         ffffffff811adc7a 0000000000000000 ffff8801ca5295d0 0000000464e224f8
         0000000000000000 0000000000000002 0000000000000000 ffff88020ce75c00
        Call Trace:
          migrate_pages+0x12a/0x850
          SYSC_mbind+0x513/0x6a0
          SyS_mbind+0xe/0x10
          ia32_do_call+0x13/0x13
        Code: 85 c0 75 2f 4c 89 e1 48 89 da 31 f6 bf da 00 02 00 65 44 8b 04 25 08 f7 1c 00 e8 ec fd ff ff 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 4c 89 e6 48 89 df ba 01 00 00 00 e8 48
        RIP  [<ffffffff8119f200>] new_vma_page+0x70/0x90
         RSP <ffff88005ab21db0>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7d97f39b
    • Jiri Slaby's avatar
      Linux 3.12.30 · b0807bc1
      Jiri Slaby authored
      b0807bc1
    • Mel Gorman's avatar
      mm: page_alloc: reduce cost of the fair zone allocation policy · 99ed1bd0
      Mel Gorman authored
      commit 4ffeaf35 upstream.
      
      The fair zone allocation policy round-robins allocations between zones
      within a node to avoid age inversion problems during reclaim.  If the
      first allocation fails, the batch counts are reset and a second attempt
      made before entering the slow path.
      
      One assumption made with this scheme is that batches expire at roughly
      the same time and the resets each time are justified.  This assumption
      does not hold when zones reach their low watermark as the batches will
      be consumed at uneven rates.  Allocation failure due to watermark
      depletion result in additional zonelist scans for the reset and another
      watermark check before hitting the slowpath.
      
      On UMA, the benefit is negligible -- around 0.25%.  On 4-socket NUMA
      machine it's variable due to the variability of measuring overhead with
      the vmstat changes.  The system CPU overhead comparison looks like
      
                3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                   vanilla   vmstat-v5 lowercost-v5
      User          746.94      774.56      802.00
      System      65336.22    32847.27    40852.33
      Elapsed     27553.52    27415.04    27368.46
      
      However it is worth noting that the overall benchmark still completed
      faster and intuitively it makes sense to take as few passes as possible
      through the zonelists.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      99ed1bd0
    • Mel Gorman's avatar
      mm: page_alloc: abort fair zone allocation policy when remotes nodes are encountered · 51aad0a5
      Mel Gorman authored
      commit f7b5d647 upstream.
      
      The purpose of numa_zonelist_order=zone is to preserve lower zones for
      use with 32-bit devices.  If locality is preferred then the
      numa_zonelist_order=node policy should be used.
      
      Unfortunately, the fair zone allocation policy overrides this by
      skipping zones on remote nodes until the lower one is found.  While this
      makes sense from a page aging and performance perspective, it breaks the
      expected zonelist policy.  This patch restores the expected behaviour
      for zone-list ordering.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      51aad0a5
    • Mel Gorman's avatar
      mm: vmscan: only update per-cpu thresholds for online CPU · fc915114
      Mel Gorman authored
      commit bb0b6dff upstream.
      
      When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered
      to get more accurate counts to avoid breaching watermarks.  This
      threshold update iterates over all possible CPUs which is unnecessary.
      Only online CPUs need to be updated.  If a new CPU is onlined,
      refresh_zone_stat_thresholds() will set the thresholds correctly.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fc915114
    • Mel Gorman's avatar
      mm: move zone->pages_scanned into a vmstat counter · 4a4ede23
      Mel Gorman authored
      commit 0d5d823a upstream.
      
      zone->pages_scanned is a write-intensive cache line during page reclaim
      and it's also updated during page free.  Move the counter into vmstat to
      take advantage of the per-cpu updates and do not update it in the free
      paths unless necessary.
      
      On a small UMA machine running tiobench the difference is marginal.  On
      a 4-node machine the overhead is more noticable.  Note that automatic
      NUMA balancing was disabled for this test as otherwise the system CPU
      overhead is unpredictable.
      
                3.16.0-rc3  3.16.0-rc3  3.16.0-rc3
                   vanillarearrange-v5   vmstat-v5
      User          746.94      759.78      774.56
      System      65336.22    58350.98    32847.27
      Elapsed     27553.52    27282.02    27415.04
      
      Note that the overhead reduction will vary depending on where exactly
      pages are allocated and freed.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4a4ede23
    • Mel Gorman's avatar
      mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines · b4fc580f
      Mel Gorman authored
      commit 3484b2de upstream.
      
      The arrangement of struct zone has changed over time and now it has
      reached the point where there is some inappropriate sharing going on.
      On x86-64 for example
      
      o The zone->node field is shared with the zone lock and zone->node is
        accessed frequently from the page allocator due to the fair zone
        allocation policy.
      
      o span_seqlock is almost never used by shares a line with free_area
      
      o Some zone statistics share a cache line with the LRU lock so
        reclaim-intensive and allocator-intensive workloads can bounce the cache
        line on a stat update
      
      This patch rearranges struct zone to put read-only and read-mostly
      fields together and then splits the page allocator intensive fields, the
      zone statistics and the page reclaim intensive fields into their own
      cache lines.  Note that the type of lowmem_reserve changes due to the
      watermark calculations being signed and avoiding a signed/unsigned
      conversion there.
      
      On the test configuration I used the overall size of struct zone shrunk
      by one cache line.  On smaller machines, this is not likely to be
      noticable.  However, on a 4-node NUMA machine running tiobench the
      system CPU overhead is reduced by this patch.
      
                3.16.0-rc3  3.16.0-rc3
                   vanillarearrange-v5r9
      User          746.94      759.78
      System      65336.22    58350.98
      Elapsed     27553.52    27282.02
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b4fc580f
    • Mel Gorman's avatar
      mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated · 59d4b113
      Mel Gorman authored
      commit 24b7e581 upstream.
      
      This was formerly the series "Improve sequential read throughput" which
      noted some major differences in performance of tiobench since 3.0.
      While there are a number of factors, two that dominated were the
      introduction of the fair zone allocation policy and changes to CFQ.
      
      The behaviour of fair zone allocation policy makes more sense than
      tiobench as a benchmark and CFQ defaults were not changed due to
      insufficient benchmarking.
      
      This series is what's left.  It's one functional fix to the fair zone
      allocation policy when used on NUMA machines and a reduction of overhead
      in general.  tiobench was used for the comparison despite its flaws as
      an IO benchmark as in this case we are primarily interested in the
      overhead of page allocator and page reclaim activity.
      
      On UMA, it makes little difference to overhead
      
                3.16.0-rc3   3.16.0-rc3
                   vanilla lowercost-v5
      User          383.61      386.77
      System        403.83      401.74
      Elapsed      5411.50     5413.11
      
      On a 4-socket NUMA machine it's a bit more noticable
      
                3.16.0-rc3   3.16.0-rc3
                   vanilla lowercost-v5
      User          746.94      802.00
      System      65336.22    40852.33
      Elapsed     27553.52    27368.46
      
      This patch (of 6):
      
      The LRU insertion and activate tracepoints take PFN as a parameter
      forcing the overhead to the caller.  Move the overhead to the tracepoint
      fast-assign method to ensure the cost is only incurred when the
      tracepoint is active.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      59d4b113
    • Jerome Marchand's avatar
      memcg, vmscan: Fix forced scan of anonymous pages · 04bab05a
      Jerome Marchand authored
      commit 2ab051e1 upstream.
      
      When memory cgoups are enabled, the code that decides to force to scan
      anonymous pages in get_scan_count() compares global values (free,
      high_watermark) to a value that is restricted to a memory cgroup (file).
      It make the code over-eager to force anon scan.
      
      For instance, it will force anon scan when scanning a memcg that is
      mainly populated by anonymous page, even when there is plenty of file
      pages to get rid of in others memcgs, even when swappiness == 0.  It
      breaks user's expectation about swappiness and hurts performance.
      
      This patch makes sure that forced anon scan only happens when there not
      enough file pages for the all zone, not just in one random memcg.
      
      [hannes@cmpxchg.org: cleanups]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      04bab05a
    • Joonsoo Kim's avatar
      vmalloc: use rcu list iterator to reduce vmap_area_lock contention · 788a2f69
      Joonsoo Kim authored
      commit 474750ab upstream.
      
      Richard Yao reported a month ago that his system have a trouble with
      vmap_area_lock contention during performance analysis by /proc/meminfo.
      Andrew asked why his analysis checks /proc/meminfo stressfully, but he
      didn't answer it.
      
        https://lkml.org/lkml/2014/4/10/416
      
      Although I'm not sure that this is right usage or not, there is a
      solution reducing vmap_area_lock contention with no side-effect.  That
      is just to use rcu list iterator in get_vmalloc_info().
      
      rcu can be used in this function because all RCU protocol is already
      respected by writers, since Nick Piggin commit db64fe02 ("mm:
      rewrite vmap layer") back in linux-2.6.28
      
      Specifically :
         insertions use list_add_rcu(),
         deletions use list_del_rcu() and kfree_rcu().
      
      Note the rb tree is not used from rcu reader (it would not be safe),
      only the vmap_area_list has full RCU protection.
      
      Note that __purge_vmap_area_lazy() already uses this rcu protection.
      
              rcu_read_lock();
              list_for_each_entry_rcu(va, &vmap_area_list, list) {
                      if (va->flags & VM_LAZY_FREE) {
                              if (va->va_start < *start)
                                      *start = va->va_start;
                              if (va->va_end > *end)
                                      *end = va->va_end;
                              nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
                              list_add_tail(&va->purge_list, &valist);
                              va->flags |= VM_LAZY_FREEING;
                              va->flags &= ~VM_LAZY_FREE;
                      }
              }
              rcu_read_unlock();
      
      Peter:
      
      : While rcu list traversal over the vmap_area_list is safe, this may
      : arrive at different results than the spinlocked version. The rcu list
      : traversal version will not be a 'snapshot' of a single, valid instant
      : of the entire vmap_area_list, but rather a potential amalgam of
      : different list states.
      
      Joonsoo:
      
      : Yes, you are right, but I don't think that we should be strict here.
      : Meminfo is already not a 'snapshot' at specific time.  While we try to get
      : certain stats, the other stats can change.  And, although we may arrive at
      : different results than the spinlocked version, the difference would not be
      : large and would not make serious side-effect.
      
      [edumazet@google.com: add more commit description]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: default avatarRichard Yao <ryao@gentoo.org>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      788a2f69
    • Jerome Marchand's avatar
      mm: make copy_pte_range static again · 84a6c769
      Jerome Marchand authored
      commit 21bda264 upstream.
      
      Commit 71e3aac0 ("thp: transparent hugepage core") adds
      copy_pte_range prototype to huge_mm.h.  I'm not sure why (or if) this
      function have been used outside of memory.c, but it currently isn't.
      This patch makes copy_pte_range() static again.
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      84a6c769
    • David Rientjes's avatar
      mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode · 6a01f8dd
      David Rientjes authored
      commit 14a4e214 upstream.
      
      Commit 9f1b868a ("mm: thp: khugepaged: add policy for finding target
      node") improved the previous khugepaged logic which allocated a
      transparent hugepages from the node of the first page being collapsed.
      
      However, it is still possible to collapse pages to remote memory which
      may suffer from additional access latency.  With the current policy, it
      is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
      remotely if the majority are allocated from that node.
      
      When zone_reclaim_mode is enabled, it means the VM should make every
      attempt to allocate locally to prevent NUMA performance degradation.  In
      this case, we do not want to collapse hugepages to remote nodes that
      would suffer from increased access latency.  Thus, when
      zone_reclaim_mode is enabled, only allow collapsing to nodes with
      RECLAIM_DISTANCE or less.
      
      There is no functional change for systems that disable
      zone_reclaim_mode.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6a01f8dd
    • Hugh Dickins's avatar
      mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault() · 1d088486
      Hugh Dickins authored
      commit c0d73261 upstream.
      
      Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
      orig_pte upon which all subsequent decisions and pte_same() tests will
      be made.
      
      I have no evidence that its lack is responsible for the mm/filemap.c:202
      BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
      trinity, and I am not optimistic that it will fix it.  But I have found
      no other explanation, and ACCESS_ONCE() here will surely not hurt.
      
      If gcc does re-access the pte before passing it down, then that would be
      disastrous for correct page fault handling, and certainly could explain
      the page_mapped() BUGs seen (concurrent fault causing page to be mapped
      in a second time on top of itself: mapcount 2 for a single pte).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1d088486
    • Hugh Dickins's avatar
      shmem: fix init_page_accessed use to stop !PageLRU bug · 335886a3
      Hugh Dickins authored
      commit 66d2f4d2 upstream.
      
      Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
      in isolate_lru_pages() at mm/vmscan.c:1281!
      
      Commit 2457aec6 ("mm: non-atomically mark page accessed during page
      cache allocation where possible") looks like interrupted work-in-progress.
      
      mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
      - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
      when the page is always visible in radix_tree, and often already on LRU.
      
      Revert change to shmem_write_begin(), and use init_page_accessed() or
      mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().
      
      SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
      before; but since many other filesystems use [__]page_symlink(), which did
      and does mark the page accessed, consider this as rectifying an oversight.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Prabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      335886a3
    • Mel Gorman's avatar
      mm: avoid unnecessary atomic operations during end_page_writeback() · 9057c212
      Mel Gorman authored
      commit 888cf2db upstream.
      
      If a page is marked for immediate reclaim then it is moved to the tail of
      the LRU list.  This occurs when the system is under enough memory pressure
      for pages under writeback to reach the end of the LRU but we test for this
      using atomic operations on every writeback.  This patch uses an optimistic
      non-atomic test first.  It'll miss some pages in rare cases but the
      consequences are not severe enough to warrant such a penalty.
      
      While the function does not dominate profiles during a simple dd test the
      cost of it is reduced.
      
      73048     0.7428  vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
      23740     0.2409  vmlinux-3.15.0-rc5-lessatomic     end_page_writeback
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9057c212
    • Mel Gorman's avatar
      mm: non-atomically mark page accessed during page cache allocation where possible · d618a27c
      Mel Gorman authored
      commit 2457aec6 upstream.
      
      aops->write_begin may allocate a new page and make it visible only to have
      mark_page_accessed called almost immediately after.  Once the page is
      visible the atomic operations are necessary which is noticable overhead
      when writing to an in-memory filesystem like tmpfs but should also be
      noticable with fast storage.  The objective of the patch is to initialse
      the accessed information with non-atomic operations before the page is
      visible.
      
      The bulk of filesystems directly or indirectly use
      grab_cache_page_write_begin or find_or_create_page for the initial
      allocation of a page cache page.  This patch adds an init_page_accessed()
      helper which behaves like the first call to mark_page_accessed() but may
      called before the page is visible and can be done non-atomically.
      
      The primary APIs of concern in this care are the following and are used
      by most filesystems.
      
      	find_get_page
      	find_lock_page
      	find_or_create_page
      	grab_cache_page_nowait
      	grab_cache_page_write_begin
      
      All of them are very similar in detail to the patch creates a core helper
      pagecache_get_page() which takes a flags parameter that affects its
      behavior such as whether the page should be marked accessed or not.  Then
      old API is preserved but is basically a thin wrapper around this core
      function.
      
      Each of the filesystems are then updated to avoid calling
      mark_page_accessed when it is known that the VM interfaces have already
      done the job.  There is a slight snag in that the timing of the
      mark_page_accessed() has now changed so in rare cases it's possible a page
      gets to the end of the LRU as PageReferenced where as previously it might
      have been repromoted.  This is expected to be rare but it's worth the
      filesystem people thinking about it in case they see a problem with the
      timing change.  It is also the case that some filesystems may be marking
      pages accessed that previously did not but it makes sense that filesystems
      have consistent behaviour in this regard.
      
      The test case used to evaulate this is a simple dd of a large file done
      multiple times with the file deleted on each iterations.  The size of the
      file is 1/10th physical memory to avoid dirty page balancing.  In the
      async case it will be possible that the workload completes without even
      hitting the disk and will have variable results but highlight the impact
      of mark_page_accessed for async IO.  The sync results are expected to be
      more stable.  The exception is tmpfs where the normal case is for the "IO"
      to not hit the disk.
      
      The test machine was single socket and UMA to avoid any scheduling or NUMA
      artifacts.  Throughput and wall times are presented for sync IO, only wall
      times are shown for async as the granularity reported by dd and the
      variability is unsuitable for comparison.  As async results were variable
      do to writback timings, I'm only reporting the maximum figures.  The sync
      results were stable enough to make the mean and stddev uninteresting.
      
      The performance results are reported based on a run with no profiling.
      Profile data is based on a separate run with oprofile running.
      
      async dd
                                          3.15.0-rc3            3.15.0-rc3
                                             vanilla           accessed-v2
      ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
      tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
      btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
      ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
      xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
      
      The XFS figure is a bit strange as it managed to avoid a worst case by
      sheer luck but the average figures looked reasonable.
      
              samples percentage
      ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      
      [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarPrabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d618a27c
    • Mel Gorman's avatar
      fs: buffer: do not use unnecessary atomic operations when discarding buffers · 967e6428
      Mel Gorman authored
      commit e7470ee8 upstream.
      
      Discarding buffers uses a bunch of atomic operations when discarding
      buffers because ......  I can't think of a reason.  Use a cmpxchg loop to
      clear all the necessary flags.  In most (all?) cases this will be a single
      atomic operations.
      
      [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      967e6428
    • Mel Gorman's avatar
      mm: do not use unnecessary atomic operations when adding pages to the LRU · 6ffef5d8
      Mel Gorman authored
      commit 6fb81a17 upstream.
      
      When adding pages to the LRU we clear the active bit unconditionally.
      As the page could be reachable from other paths we cannot use unlocked
      operations without risk of corruption such as a parallel
      mark_page_accessed.  This patch tests if is necessary to clear the
      active flag before using an atomic operation.  This potentially opens a
      tiny race when PageActive is checked as mark_page_accessed could be
      called after PageActive was checked.  The race already exists but this
      patch changes it slightly.  The consequence is that that the page may be
      promoted to the active list that might have been left on the inactive
      list before the patch.  It's too tiny a race and too marginal a
      consequence to always use atomic operations for.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6ffef5d8