Commit f8fd525b authored by Donet Tom's avatar Donet Tom Committed by Andrew Morton

mm/mempolicy: use numa_node_id() instead of cpu_to_node()

Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY
policy:, v4.

This patchset is to optimize the cross-socket memory access with
MPOL_PREFERRED_MANY policy.

To test this patch we ran the following test on a 3 node system.
 Node 0 - 2GB   - Tier 1
 Node 1 - 11GB  - Tier 1
 Node 6 - 10GB  - Tier 2

Below changes are made to memcached to set the memory policy,
It select Node0 and Node1 as preferred nodes.

   #include <numaif.h>
   #include <numa.h>

    unsigned long nodemask;
    int ret;

    nodemask = 0x03;
    ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
                                               &nodemask, 10);
    /* If MPOL_F_NUMA_BALANCING isn't supported,
     * fall back to MPOL_PREFERRED_MANY */
    if (ret < 0 && errno == EINVAL){
       printf("set mem policy normal\n");
        ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
    }
    if (ret < 0) {
       perror("Failed to call set_mempolicy");
       exit(-1);
    }

Test Procedure:
===============
1. Make sure memory tiring and demotion are enabled.
2. Start memcached.

   # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
       -d -s "/tmp/memcached.sock"

3. Run memtier_benchmark to store 3200000 keys.

  #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
    --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
    --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024

4. Start a memory eater on node 0 and 1. This will demote all memcached
   pages to node 6.
5. Make sure all the memcached pages got demoted to lower tier by reading
   /proc/<memcaced PID>/numa_maps.

    # cat /proc/2771/numa_maps
     ---
    default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
    default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
     ---

6. Kill memory eater.
7. Read the pgpromote_success counter.
8. Start reading the keys by running memtier_benchmark.

  #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
   --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
   --key-minimum=1 --key-maximum=3200000 -n allkeys
   --threads=64 -c 1 -R -x 6

9. Read the pgpromote_success counter.

Test Results:
=============
Without Patch
------------------
1. pgpromote_success  before test
Node 0:  pgpromote_success 11
Node 1:  pgpromote_success 140974

pgpromote_success  after test
Node 0:  pgpromote_success 11
Node 1:  pgpromote_success 140974

2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
==================================================================
Type    Ops/sec   Hits/sec   Misses/sec  Avg. Latency  p50 Latency
------------------------------------------------------------------
Sets     0.00       ---         ---        ---          ---
Gets    305792.03  305791.93   0.10       0.18949       0.16700
Waits    0.00       ---         ---        ---          ---
Totals  305792.03  305791.93   0.10       0.18949       0.16700

======================================
p99 Latency  p99.9 Latency  KB/sec
-------------------------------------
---          ---            0.00
0.44700     1.71100        11542.69
---           ---            ---
0.44700     1.71100        11542.69

With Patch
---------------
1. pgpromote_success  before test
Node 0:  pgpromote_success 5
Node 1:  pgpromote_success 89386

pgpromote_success  after test
Node 0:  pgpromote_success 57895
Node 1:  pgpromote_success 141463

2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
====================================================================
Type    Ops/sec    Hits/sec  Misses/sec  Avg. Latency  p50 Latency
--------------------------------------------------------------------
Sets     0.00        ---       ---        ---           ---
Gets    521942.24  521942.07  0.17       0.11459        0.10300
Waits    0.00        ---       ---         ---          ---
Totals  521942.24  521942.07  0.17       0.11459        0.10300

=======================================
p99 Latency  p99.9 Latency  KB/sec
---------------------------------------
 ---          ---            0.00
0.23100      0.31900        19701.68
---          ---             ---
0.23100      0.31900        19701.68


Test Result Analysis:
=====================
1. With patch we could observe pages are getting promoted.
2. Memtier-benchmark results shows that, with the patch,
   performance has increased more than 50%.

 Ops/sec without fix -  305792.03
 Ops/sec with fix    -  521942.24


This patch (of 2):

Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is
quicker.  smp_processor_id is guaranteed to be stable in the
'mpol_misplaced()' function because it is called with ptl held. 
lockdep_assert_held was added to ensure that.

No functional change in this patch.

[donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment]
  Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Signed-off-by: default avatarDonet Tom <donettom@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent fea68a75
...@@ -167,7 +167,8 @@ extern void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol); ...@@ -167,7 +167,8 @@ extern void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol);
/* Check if a vma is migratable */ /* Check if a vma is migratable */
extern bool vma_migratable(struct vm_area_struct *vma); extern bool vma_migratable(struct vm_area_struct *vma);
int mpol_misplaced(struct folio *, struct vm_area_struct *, unsigned long); int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
unsigned long addr);
extern void mpol_put_task_policy(struct task_struct *); extern void mpol_put_task_policy(struct task_struct *);
static inline bool mpol_is_preferred_many(struct mempolicy *pol) static inline bool mpol_is_preferred_many(struct mempolicy *pol)
...@@ -282,7 +283,7 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol) ...@@ -282,7 +283,7 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol)
#endif #endif
static inline int mpol_misplaced(struct folio *folio, static inline int mpol_misplaced(struct folio *folio,
struct vm_area_struct *vma, struct vm_fault *vmf,
unsigned long address) unsigned long address)
{ {
return -1; /* no node preference */ return -1; /* no node preference */
......
...@@ -1754,7 +1754,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) ...@@ -1754,7 +1754,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
*/ */
if (node_is_toptier(nid)) if (node_is_toptier(nid))
last_cpupid = folio_last_cpupid(folio); last_cpupid = folio_last_cpupid(folio);
target_nid = numa_migrate_prep(folio, vma, haddr, nid, &flags); target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
if (target_nid == NUMA_NO_NODE) { if (target_nid == NUMA_NO_NODE) {
folio_put(folio); folio_put(folio);
goto out_map; goto out_map;
......
...@@ -1087,7 +1087,7 @@ void vunmap_range_noflush(unsigned long start, unsigned long end); ...@@ -1087,7 +1087,7 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
void __vunmap_range_noflush(unsigned long start, unsigned long end); void __vunmap_range_noflush(unsigned long start, unsigned long end);
int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
unsigned long addr, int page_nid, int *flags); unsigned long addr, int page_nid, int *flags);
void free_zone_device_page(struct page *page); void free_zone_device_page(struct page *page);
......
...@@ -5035,9 +5035,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf) ...@@ -5035,9 +5035,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
return ret; return ret;
} }
int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
unsigned long addr, int page_nid, int *flags) unsigned long addr, int page_nid, int *flags)
{ {
struct vm_area_struct *vma = vmf->vma;
folio_get(folio); folio_get(folio);
/* Record the current PID acceesing VMA */ /* Record the current PID acceesing VMA */
...@@ -5049,7 +5051,7 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma, ...@@ -5049,7 +5051,7 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
*flags |= TNF_FAULT_LOCAL; *flags |= TNF_FAULT_LOCAL;
} }
return mpol_misplaced(folio, vma, addr); return mpol_misplaced(folio, vmf, addr);
} }
static vm_fault_t do_numa_page(struct vm_fault *vmf) static vm_fault_t do_numa_page(struct vm_fault *vmf)
...@@ -5123,7 +5125,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) ...@@ -5123,7 +5125,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
last_cpupid = (-1 & LAST_CPUPID_MASK); last_cpupid = (-1 & LAST_CPUPID_MASK);
else else
last_cpupid = folio_last_cpupid(folio); last_cpupid = folio_last_cpupid(folio);
target_nid = numa_migrate_prep(folio, vma, vmf->address, nid, &flags); target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
if (target_nid == NUMA_NO_NODE) { if (target_nid == NUMA_NO_NODE) {
folio_put(folio); folio_put(folio);
goto out_map; goto out_map;
......
...@@ -2718,7 +2718,7 @@ static void sp_free(struct sp_node *n) ...@@ -2718,7 +2718,7 @@ static void sp_free(struct sp_node *n)
* mpol_misplaced - check whether current folio node is valid in policy * mpol_misplaced - check whether current folio node is valid in policy
* *
* @folio: folio to be checked * @folio: folio to be checked
* @vma: vm area where folio mapped * @vmf: structure describing the fault
* @addr: virtual address in @vma for shared policy lookup and interleave policy * @addr: virtual address in @vma for shared policy lookup and interleave policy
* *
* Lookup current policy node id for vma,addr and "compare to" folio's * Lookup current policy node id for vma,addr and "compare to" folio's
...@@ -2728,18 +2728,24 @@ static void sp_free(struct sp_node *n) ...@@ -2728,18 +2728,24 @@ static void sp_free(struct sp_node *n)
* Return: NUMA_NO_NODE if the page is in a node that is valid for this * Return: NUMA_NO_NODE if the page is in a node that is valid for this
* policy, or a suitable node ID to allocate a replacement folio from. * policy, or a suitable node ID to allocate a replacement folio from.
*/ */
int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
unsigned long addr) unsigned long addr)
{ {
struct mempolicy *pol; struct mempolicy *pol;
pgoff_t ilx; pgoff_t ilx;
struct zoneref *z; struct zoneref *z;
int curnid = folio_nid(folio); int curnid = folio_nid(folio);
struct vm_area_struct *vma = vmf->vma;
int thiscpu = raw_smp_processor_id(); int thiscpu = raw_smp_processor_id();
int thisnid = cpu_to_node(thiscpu); int thisnid = numa_node_id();
int polnid = NUMA_NO_NODE; int polnid = NUMA_NO_NODE;
int ret = NUMA_NO_NODE; int ret = NUMA_NO_NODE;
/*
* Make sure ptl is held so that we don't preempt and we
* have a stable smp processor id
*/
lockdep_assert_held(vmf->ptl);
pol = get_vma_policy(vma, addr, folio_order(folio), &ilx); pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
if (!(pol->flags & MPOL_F_MOF)) if (!(pol->flags & MPOL_F_MOF))
goto out; goto out;
...@@ -2781,7 +2787,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, ...@@ -2781,7 +2787,7 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
if (node_isset(curnid, pol->nodes)) if (node_isset(curnid, pol->nodes))
goto out; goto out;
z = first_zones_zonelist( z = first_zones_zonelist(
node_zonelist(numa_node_id(), GFP_HIGHUSER), node_zonelist(thisnid, GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER),
&pol->nodes); &pol->nodes);
polnid = zone_to_nid(z->zone); polnid = zone_to_nid(z->zone);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment