• KAMEZAWA Hiroyuki's avatar
    mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware · 4bfc4495
    KAMEZAWA Hiroyuki authored
    At first, init_task's mems_allowed is initialized as this.
     init_task->mems_allowed == node_state[N_POSSIBLE]
    
    And cpuset's top_cpuset mask is initialized as this
     top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]
    
    Before 2.6.29:
    policy's mems_allowed is initialized as this.
    
      1. update tasks->mems_allowed by its cpuset->mems_allowed.
      2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)
    
    Updating task's mems_allowed in reference to top_cpuset's one.
    cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.
    
    In 2.6.30: After commit 58568d2a
    ("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
    is initialized as this.
    
      1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)
    
    Here, if task is in top_cpuset, task->mems_allowed is not updated from
    init's one.  Assume user excutes command as #numactrl --interleave=all
    ,....
    
      policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)
    
    Then, policy's mems_allowd can includes a possible node, which has no pgdat.
    
    MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
    directly.
    
      NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL
    
    Then, what's we need is making policy->mems_allowed be aware of
    N_HIGH_MEMORY.  This patch does that.  But to do so, extra nodemask will
    be on statck.  Because I know cpumask has a new interface of
    CPUMASK_ALLOC(), I added it to node.
    
    This patch stands on old behavior.  But I feel this fix itself is just a
    Band-Aid.  But to do fundametal fix, we have to take care of memory
    hotplug and it takes time.  (task->mems_allowd should be N_HIGH_MEMORY, I
    think.)
    
    mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
    should be includes only online nodes.
    
    In old behavior, this is guaranteed by frequent reference to cpuset's
    code.  Now, most of them are removed and mempolicy has to check it by
    itself.
    
    To do check, a few nodemask_t will be used for calculating nodemask.  But,
    size of nodemask_t can be big and it's not good to allocate them on stack.
    
    Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
    NODEMASK_ALLOC/FREE shoudl be there.
    
    [akpm@linux-foundation.org: cleanups & tweaks]
    Tested-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Miao Xie <miaox@cn.fujitsu.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: Paul Menage <menage@google.com>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
    Cc: Pekka Enberg <penberg@cs.helsinki.fi>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    4bfc4495
mempolicy.c 61.4 KB