1. 24 Feb, 2013 40 commits
    • Yasuaki Ishimatsu's avatar
      x86: get pg_data_t's memory from other node · 4d59a751
      Yasuaki Ishimatsu authored
      During the implementation of SRAT support, we met a problem.  In
      setup_arch(), we have the following call series:
      
       1) memblock is ready;
       2) some functions use memblock to allocate memory;
       3) parse ACPI tables, such as SRAT.
      
      Before 3), we don't know which memory is hotpluggable, and as a result,
      we cannot prevent memblock from allocating hotpluggable memory.  So, in
      2), there could be some hotpluggable memory allocated by memblock.
      
      Now, we are trying to parse SRAT earlier, before memblock is ready.  But
      I think we need more investigation on this topic.  So in this v5, I
      dropped all the SRAT support, and v5 is just the same as v3, and it is
      based on 3.8-rc3.
      
      As we planned, we will support getting info from SRAT without users'
      participation at last.  And we will post another patch-set to do so.
      
      And also, I think for now, we can add this boot option as the first step
      of supporting movable node.  Since Linux cannot migrate the direct
      mapped pages, the only way for now is to limit the whole node containing
      only movable memory.
      
      Using SRAT is one way.  But even if we can use SRAT, users still need an
      interface to enable/disable this functionality if they don't want to
      loose their NUMA performance.  So I think, a user interface is always
      needed.
      
      For now, users can disable this functionality by not specifying the boot
      option.  Later, we will post SRAT support, and add another option value
      "movablecore_map=acpi" to using SRAT.
      
      This patch:
      
      If system can create movable node which all memory of the node is
      allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
      the node's pg_data_t.  So, use memblock_alloc_try_nid() instead of
      memblock_alloc_nid() to retry when the first allocation fails.
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d59a751
    • Tang Chen's avatar
      sched: do not use cpu_to_node() to find an offlined cpu's node. · aa00d89c
      Tang Chen authored
      If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
      will return -1.  As a result, cpumask_of_node(nid) will return NULL.  In
      this case, find_next_bit() in for_each_cpu will get a NULL pointer and
      cause panic.
      
      Here is a call trace:
        Call Trace:
         <IRQ>
          select_fallback_rq+0x71/0x190
          try_to_wake_up+0x2cb/0x2f0
          wake_up_process+0x15/0x20
          hrtimer_wakeup+0x22/0x30
          __run_hrtimer+0x83/0x320
          hrtimer_interrupt+0x106/0x280
          smp_apic_timer_interrupt+0x69/0x99
          apic_timer_interrupt+0x6f/0x80
      
      There is a hrtimer process sleeping, whose cpu has already been
      offlined.  When it is waken up, it tries to find another cpu to run, and
      get a -1 nid.  As a result, cpumask_of_node(-1) returns NULL, and causes
      ernel panic.
      
      This patch fixes this problem by judging if the nid is -1.  If nid is
      not -1, a cpu on the same node will be picked.  Else, a online cpu on
      another node will be picked.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa00d89c
    • Wen Congyang's avatar
      cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node · e13fe869
      Wen Congyang authored
      When the node is offlined, there is no memory/cpu on the node.  If a
      sleep task runs on a cpu of this node, it will be migrated to the cpu on
      the other node.  So we can clear cpu-to-node mapping.
      
      [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e13fe869
    • Wen Congyang's avatar
      cpu-hotplug, memory-hotplug: try offlining the node when hotremoving a cpu · 76bba142
      Wen Congyang authored
      The node will be offlined when all memory/cpu on the node is hotremoved.
      So we should try offline the node when hotremoving a cpu on the node.
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76bba142
    • Wen Congyang's avatar
      memory-hotplug: export the function try_offline_node() · 90b30cdc
      Wen Congyang authored
      try_offline_node() will be needed in the tristate
      drivers/acpi/processor_driver.c.
      
      The node will be offlined when all memory/cpu on the node have been
      hotremoved.  So we need the function try_offline_node() in cpu-hotplug
      path.
      
      If the memory-hotplug is disabled, and cpu-hotplug is enabled
      
      1. no memory no the node
         we don't online the node, and cpu's node is the nearest node.
      
      2. the node contains some memory
         the node has been onlined, and cpu's node is still needed
         to migrate the sleep task on the cpu to the same node.
      
      So we do nothing in try_offline_node() in this case.
      
      [rientjes@google.com: export the function try_offline_node() fix]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90b30cdc
    • Wen Congyang's avatar
      cpu_hotplug: clear apicid to node when the cpu is hotremoved · c4c60524
      Wen Congyang authored
      When a cpu is hotpluged, we call acpi_map_cpu2node() in
      _acpi_map_lsapic() to store the cpu's node and apicid's node.  But we
      don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
      hotremoved.  If the node is also hotremoved, we will get the following
      messages:
      
        kernel BUG at include/linux/gfp.h:329!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
        RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
        RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
        RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
        R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
        FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
        Call Trace:
          new_slab+0x30/0x1b0
          __slab_alloc+0x358/0x4c0
          kmem_cache_alloc_node_trace+0xb4/0x1e0
          alloc_fair_sched_group+0xd0/0x1b0
          sched_create_group+0x3e/0x110
          sched_autogroup_create_attach+0x4d/0x180
          sys_setsid+0xd4/0xf0
          system_call_fastpath+0x16/0x1b
        Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
        RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
         RSP <ffff88078a049cf8>
        ---[ end trace adf84c90f3fea3e5 ]---
      
      The reason is that the cpu's node is not NUMA_NO_NODE, we will call
      alloc_pages_exact_node() to alloc memory on the node, but the node is
      offlined.
      
      If the node is onlined, we still need cpu's node.  For example: a task
      on the cpu is sleeped when the cpu is hotremoved.  We will choose
      another cpu to run this task when it is waked up.  If we know the cpu's
      node, we will choose the cpu on the same node first.  So we should clear
      cpu-to-node mapping when the node is offlined.
      
      This patch only clears apicid-to-node mapping when the cpu is
      hotremoved.
      
      [akpm@linux-foundation.org: fix section error]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4c60524
    • Lai Jiangshan's avatar
      mempolicy: fix is_valid_nodemask() · d3eb1570
      Lai Jiangshan authored
      is_valid_nodemask() was introduced by commit 19770b32 ("mm: filter
      based on a nodemask as well as a gfp_mask").  but it does not match its
      comments, because it does not check the zone which > policy_zone.
      
      Also in commit b377fd39 ("Apply memory policies to top two highest
      zones when highest zone is ZONE_MOVABLE"), this commits told us, if
      highest zone is ZONE_MOVABLE, we should also apply memory policies to
      it.  so ZONE_MOVABLE should be valid zone for policies.
      is_valid_nodemask() need to be changed to match it.
      
      Fix: check all zones, even its zoneid > policy_zone.  Use
      nodes_intersects() instead open code to check it.
      Reported-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3eb1570
    • Wen Congyang's avatar
      memory-hotplug: consider compound pages when free memmap · 8a356ce3
      Wen Congyang authored
      usemap could also be allocated as compound pages.  Should also consider
      compound pages when freeing memmap.
      
      If we don't fix it, there could be problems when we free vmemmap
      pagetables which are stored in compound pages.  The old pagetables will
      not be freed properly, and when we add the memory again, no new
      pagetable will be created.  And the old pagetable entry is used, than
      the kernel will panic.
      
      The call trace is like the following:
      
        BUG: unable to handle kernel paging request at ffffea0040000000
        IP: [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
        PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
        Oops: 0002 [#1] SMP
        Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        CPU 0
        Pid: 4, comm: kworker/0:0 Tainted: G        W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
        RIP: 0010:[<ffffffff816a483f>]  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
        RSP: 0018:ffff8807bdcb35d8  EFLAGS: 00010006
        RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
        RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
        RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
        R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
        R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
        FS:  0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
        Call Trace:
          __add_pages+0x85/0x120
          arch_add_memory+0x71/0xf0
          add_memory+0xd6/0x1f0
          acpi_memory_device_add+0x170/0x20c
          acpi_device_probe+0x50/0x18a
          really_probe+0x6c/0x320
          driver_probe_device+0x47/0xa0
          __device_attach+0x53/0x60
          bus_for_each_drv+0x6c/0xa0
          device_attach+0xa8/0xc0
          bus_probe_device+0xb0/0xe0
          device_add+0x301/0x570
          device_register+0x1e/0x30
          acpi_device_register+0x1d8/0x27c
          acpi_add_single_object+0x1df/0x2b9
          acpi_bus_check_add+0x112/0x18f
          acpi_ns_walk_namespace+0x105/0x255
          acpi_walk_namespace+0xcf/0x118
          acpi_bus_scan+0x5b/0x7c
          acpi_bus_add+0x2a/0x2c
          container_notify_cb+0x112/0x1a9
          acpi_ev_notify_dispatch+0x46/0x61
          acpi_os_execute_deferred+0x27/0x34
          process_one_work+0x20e/0x5c0
          worker_thread+0x12e/0x370
          kthread+0xee/0x100
          ret_from_fork+0x7c/0xb0
        Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 <f3> aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
        RIP  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
         RSP <ffff8807bdcb35d8>
        CR2: ffffea0040000000
        ---[ end trace e7f94e3a34c442d4 ]---
        Kernel panic - not syncing: Fatal exception
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a356ce3
    • Tang Chen's avatar
      memory-hotplug: do not allocate pgdat if it was not freed when offline. · a1e565aa
      Tang Chen authored
      Since there is no way to guarentee the address of pgdat/zone is not on
      stack of any kernel threads or used by other kernel objects without
      reference counting or other symchronizing method, we cannot reset
      node_data and free pgdat when offlining a node.  Just reset pgdat to 0
      and reuse the memory when the node is online again.
      
      The problem is suggested by Kamezawa Hiroyuki.  The idea is from Wen
      Congyang.
      
      NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
            will be triggered.
      
      [akpm@linux-foundation.org: fix warning when CONFIG_NEED_MULTIPLE_NODES=n]
      [akpm@linux-foundation.org: fix the warning again again]
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1e565aa
    • Wen Congyang's avatar
      memory-hotplug: free node_data when a node is offlined · d822b86a
      Wen Congyang authored
      We call hotadd_new_pgdat() to allocate memory to store node_data.  So we
      should free it when removing a node.
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d822b86a
    • Tang Chen's avatar
      memory-hotplug: remove sysfs file of node · 60a5a19e
      Tang Chen authored
      Introduce a new function try_offline_node() to remove sysfs file of node
      when all memory sections of this node are removed.  If some memory
      sections of this node are not removed, this function does nothing.
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60a5a19e
    • Yasuaki Ishimatsu's avatar
      memory_hotplug: clear zone when removing the memory · 815121d2
      Yasuaki Ishimatsu authored
      When memory is added, we update zone's and pgdat's start_pfn and
      spanned_pages in __add_zone().  So we should revert them when the memory
      is removed.
      
      The patch adds a new function __remove_zone() to do this.
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      815121d2
    • Tang Chen's avatar
      memory-hotplug: integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP. · 5fc1d66a
      Tang Chen authored
      Currently __remove_section for SPARSEMEM_VMEMMAP does nothing.  But even
      if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fc1d66a
    • Tang Chen's avatar
      memory-hotplug: remove memmap of sparse-vmemmap · 0197518c
      Tang Chen authored
      Introduce a new API vmemmap_free() to free and remove vmemmap
      pagetables.  Since pagetable implements are different, each architecture
      has to provide its own version of vmemmap_free(), just like
      vmemmap_populate().
      
      Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.
      
      [mhocko@suse.cz: fix implicit declaration of remove_pagetable]
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0197518c
    • Tang Chen's avatar
      memory-hotplug: remove page table of x86_64 architecture · bbcab878
      Tang Chen authored
      Search a page table about the removed memory, and clear page table for
      x86_64 architecture.
      
      [akpm@linux-foundation.org: make kernel_physical_mapping_remove() static]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbcab878
    • Wen Congyang's avatar
      memory-hotplug: common APIs to support page tables hot-remove · ae9aae9e
      Wen Congyang authored
      When memory is removed, the corresponding pagetables should alse be
      removed.  This patch introduces some common APIs to support vmemmap
      pagetable and x86_64 architecture direct mapping pagetable removing.
      
      All pages of virtual mapping in removed memory cannot be freed if some
      pages used as PGD/PUD include not only removed memory but also other
      memory.  So this patch uses the following way to check whether a page
      can be freed or not.
      
      1) When removing memory, the page structs of the removed memory are
         filled with 0FD.
      
      2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be
         cleared.  In this case, the page used as PT/PMD can be freed.
      
      For direct mapping pages, update direct_pages_count[level] when we freed
      their pagetables.  And do not free the pages again because they were
      freed when offlining.
      
      For vmemmap pages, free the pages and their pagetables.
      
      For larger pages, do not split them into smaller ones because there is
      no way to know if the larger page has been split.  As a result, there is
      no way to decide when to split.  We deal the larger pages in the
      following way:
      
      1) For direct mapped pages, all the pages were freed when they were
         offlined.  And since menmory offline is done section by section, all
         the memory ranges being removed are aligned to PAGE_SIZE.  So only need
         to deal with unaligned pages when freeing vmemmap pages.
      
      2) For vmemmap pages being used to store page_struct, if part of the
         larger page is still in use, just fill the unused part with 0xFD.  And
         when the whole page is fulfilled with 0xFD, then free the larger page.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [tangchen@cn.fujitsu.com: do not calculate direct mapping pages when freeing vmemmap pagetables]
      [tangchen@cn.fujitsu.com: do not free direct mapping pages twice]
      [tangchen@cn.fujitsu.com: do not free page split from hugepage one by one]
      [tangchen@cn.fujitsu.com: do not split pages when freeing pagetable pages]
      [akpm@linux-foundation.org: use pmd_page_vaddr()]
      [akpm@linux-foundation.org: fix used-uninitialised bug]
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae9aae9e
    • Tang Chen's avatar
      memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() · cd099682
      Tang Chen authored
      In __remove_section(), we locked pgdat_resize_lock when calling
      sparse_remove_one_section().  This lock will disable irq.  But we don't
      need to lock the whole function.  If we do some work to free pagetables
      in free_section_usemap(), we need to call flush_tlb_all(), which need
      irq enabled.  Otherwise the WARN_ON_ONCE() in smp_call_function_many()
      will be triggered.
      
      If we lock the whole sparse_remove_one_section(), then we come to this call trace:
      
        ------------[ cut here ]------------
        WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
        Hardware name: PRIMEQUEST 1800E
        ......
        Call Trace:
          smp_call_function_many+0xbd/0x260
          smp_call_function+0x3b/0x50
          on_each_cpu+0x3b/0xc0
          flush_tlb_all+0x1c/0x20
          remove_pagetable+0x14e/0x1d0
          vmemmap_free+0x18/0x20
          sparse_remove_one_section+0xf7/0x100
          __remove_section+0xa2/0xb0
          __remove_pages+0xa0/0xd0
          arch_remove_memory+0x6b/0xc0
          remove_memory+0xb8/0xf0
          acpi_memory_device_remove+0x53/0x96
          acpi_device_remove+0x90/0xb2
          __device_release_driver+0x7c/0xf0
          device_release_driver+0x2f/0x50
          acpi_bus_remove+0x32/0x6d
          acpi_bus_trim+0x91/0x102
          acpi_bus_hot_remove_device+0x88/0x16b
          acpi_os_execute_deferred+0x27/0x34
          process_one_work+0x20e/0x5c0
          worker_thread+0x12e/0x370
          kthread+0xee/0x100
          ret_from_fork+0x7c/0xb0
        ---[ end trace 25e85300f542aa01 ]---
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd099682
    • Yasuaki Ishimatsu's avatar
      memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap · 46723bfa
      Yasuaki Ishimatsu authored
      For removing memmap region of sparse-vmemmap which is allocated bootmem,
      memmap region of sparse-vmemmap needs to be registered by
      get_page_bootmem().  So the patch searches pages of virtual mapping and
      registers the pages by get_page_bootmem().
      
      NOTE: register_page_bootmem_memmap() is not implemented for ia64,
            ppc, s390, and sparc.  So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
            and revert register_page_bootmem_info_node() when platform doesn't
            support it.
      
            It's implemented by adding a new Kconfig option named
            CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
            by memory-hotplug feature fully supported archs(currently only on
            x86_64).
      
            Since we have 2 config options called MEMORY_HOTPLUG and
            MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
            and codes in function register_page_bootmem_info_node() are only
            used for collecting infomation for hot-remove, so reside it under
            MEMORY_HOTREMOVE.
      
            Besides page_isolation.c selected by MEMORY_ISOLATION under
            MEMORY_HOTPLUG is also such case, move it too.
      
      [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
      [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
      [mhocko@suse.cz: remove the arch specific functions without any implementation]
      [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
      [rientjes@google.com: fix defined but not used warning]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarWu Jianguo <wujianguo@huawei.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarLin Feng <linfeng@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46723bfa
    • Wen Congyang's avatar
      memory-hotplug: introduce new arch_remove_memory() for removing page table · 24d335ca
      Wen Congyang authored
      For removing memory, we need to remove page tables.  But it depends on
      architecture.  So the patch introduce arch_remove_memory() for removing
      page table.  Now it only calls __remove_pages().
      
      Note: __remove_pages() for some archtecuture is not implemented
            (I don't know how to implement it for s390).
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24d335ca
    • Yasuaki Ishimatsu's avatar
      memory-hotplug: remove /sys/firmware/memmap/X sysfs · 46c66c4b
      Yasuaki Ishimatsu authored
      When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start,
      type} sysfs files are created.  But there is no code to remove these
      files.  This patch implements the function to remove them.
      
      We cannot free firmware_map_entry which is allocated by bootmem because
      there is no way to do so when the system is up.  But we can at least
      remember the address of that memory and reuse the storage when the
      memory is added next time.
      
      This patch also introduces a new list map_entries_bootmem to link the
      map entries allocated by bootmem when they are removed, and a lock to
      protect it.  And these entries will be reused when the memory is
      hot-added again.
      
      The idea is suggestted by Andrew Morton.
      
      NOTE: It is unsafe to return an entry pointer and release the
            map_entries_lock.  So we should not hold the map_entries_lock
            separately in firmware_map_find_entry() and
            firmware_map_remove_entry().  Hold the map_entries_lock across find
            and remove /sys/firmware/memmap/X operation.
      
             And also, users of these two functions need to be careful to
            hold the lock when using these two functions.
      
      [tangchen@cn.fujitsu.com: Hold spinlock across find|remove /sys operation]
      [tangchen@cn.fujitsu.com: fix the wrong comments of map_entries]
      [tangchen@cn.fujitsu.com: reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem]
      [tangchen@cn.fujitsu.com: fix section mismatch problem]
      [tangchen@cn.fujitsu.com: fix the doc format in drivers/firmware/memmap.c]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Julian Calaby <julian.calaby@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46c66c4b
    • Wen Congyang's avatar
      memory-hotplug: remove redundant codes · bbc76be6
      Wen Congyang authored
      offlining memory blocks and checking whether memory blocks are offlined
      are very similar.  This patch introduces a new function to remove
      redundant codes.
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbc76be6
    • Yasuaki Ishimatsu's avatar
      memory-hotplug: check whether all memory blocks are offlined or not when removing memory · 6677e3ea
      Yasuaki Ishimatsu authored
      We remove the memory like this:
      
       1. lock memory hotplug
       2. offline a memory block
       3. unlock memory hotplug
       4. repeat 1-3 to offline all memory blocks
       5. lock memory hotplug
       6. remove memory(TODO)
       7. unlock memory hotplug
      
      All memory blocks must be offlined before removing memory.  But we don't
      hold the lock in the whole operation.  So we should check whether all
      memory blocks are offlined before step6.  Otherwise, kernel maybe
      panicked.
      
      Offlining a memory block and removing a memory device can be two
      different operations.  Users can just offline some memory blocks without
      removing the memory device.  For this purpose, the kernel has held
      lock_memory_hotplug() in __offline_pages().  To reuse the code for
      memory hot-remove, we repeat step 1-3 to offline all the memory blocks,
      repeatedly lock and unlock memory hotplug, but not hold the memory
      hotplug lock in the whole operation.
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6677e3ea
    • Wen Congyang's avatar
      memory-hotplug: try to offline the memory twice to avoid dependence · 993c1aad
      Wen Congyang authored
      memory can't be offlined when CONFIG_MEMCG is selected.  For example:
      there is a memory device on node 1.  The address range is [1G, 1.5G).
      You will find 4 new directories memory8, memory9, memory10, and memory11
      under the directory /sys/devices/system/memory/.
      
      If CONFIG_MEMCG is selected, we will allocate memory to store page
      cgroup when we online pages.  When we online memory8, the memory stored
      page cgroup is not provided by this memory device.  But when we online
      memory9, the memory stored page cgroup may be provided by memory8.  So
      we can't offline memory8 now.  We should offline the memory in the
      reversed order.
      
      When the memory device is hotremoved, we will auto offline memory
      provided by this memory device.  But we don't know which memory is
      onlined first, so offlining memory may fail.  In such case, iterate
      twice to offline the memory.  1st iterate: offline every non primary
      memory block.  2nd iterate: offline primary (i.e.  first added) memory
      block.
      
      This idea is suggested by KOSAKI Motohiro.
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      993c1aad
    • Sasha Levin's avatar
      mm: memory_hotplug: no need to check res twice in add_memory · a864b9d0
      Sasha Levin authored
      Remove one redundant check of res.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a864b9d0
    • Michel Lespinasse's avatar
      mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool · 41badc15
      Michel Lespinasse authored
      do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
      multiple, however there was no equivalent code in mm_populate(), which
      caused issues.
      
      This could be fixed by introduced the same rounding in mm_populate(),
      however I think it's preferable to make do_mmap_pgoff() return populate
      as a size rather than as a boolean, so we don't have to duplicate the
      size rounding logic in mm_populate().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41badc15
    • Michel Lespinasse's avatar
      mm: introduce VM_POPULATE flag to better deal with racy userspace programs · 18693050
      Michel Lespinasse authored
      The vm_populate() code populates user mappings without constantly
      holding the mmap_sem.  This makes it susceptible to racy userspace
      programs: the user mappings may change while vm_populate() is running,
      and in this case vm_populate() may end up populating the new mapping
      instead of the old one.
      
      In order to reduce the possibility of userspace getting surprised by
      this behavior, this change introduces the VM_POPULATE vma flag which
      gets set on vmas we want vm_populate() to work on.  This way
      vm_populate() may still end up populating the new mapping after such a
      race, but only if the new mapping is also one that the user has
      requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18693050
    • Michel Lespinasse's avatar
      mm: directly use __mlock_vma_pages_range() in find_extend_vma() · cea10a19
      Michel Lespinasse authored
      In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
      the vma type - we know we're working with a stack.  So, we can call
      directly into __mlock_vma_pages_range(), and remove the last
      make_pages_present() call site.
      
      Note that we don't use mm_populate() here, so we can't release the
      mmap_sem while allocating new stack pages.  This is deemed acceptable,
      because the stack vmas grow by a bounded number of pages at a time, and
      these are anon pages so we don't have to read from disk to populate
      them.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cea10a19
    • Michel Lespinasse's avatar
      mm: remove flags argument to mmap_region · c22c0d63
      Michel Lespinasse authored
      After the MAP_POPULATE handling has been moved to mmap_region() call
      sites, the only remaining use of the flags argument is to pass the
      MAP_NORESERVE flag.  This can be just as easily handled by
      do_mmap_pgoff(), so do that and remove the mmap_region() flags
      parameter.
      
      [akpm@linux-foundation.org: remove double parens]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c22c0d63
    • Michel Lespinasse's avatar
      mm: use mm_populate() for mremap() of VM_LOCKED vmas · 81909b84
      Michel Lespinasse authored
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81909b84
    • Michel Lespinasse's avatar
      mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect · 128557ff
      Michel Lespinasse authored
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      128557ff
    • Michel Lespinasse's avatar
      mm: use mm_populate() for blocking remap_file_pages() · a1ea9549
      Michel Lespinasse authored
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1ea9549
    • Michel Lespinasse's avatar
      mm: introduce mm_populate() for populating new vmas · bebeb3d6
      Michel Lespinasse authored
      When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
      with MCL_FUTURE in effect), we want to populate the pages within the
      newly created vmas.  This may take a while as we may have to read pages
      from disk, so ideally we want to do this outside of the write-locked
      mmap_sem region.
      
      This change introduces mm_populate(), which is used to defer populating
      such mappings until after the mmap_sem write lock has been released.
      This is implemented as a generalization of the former do_mlock_pages(),
      which accomplished the same task but was using during mlock() /
      mlockall().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bebeb3d6
    • Michel Lespinasse's avatar
      mm: remap_file_pages() fixes · 940e7da5
      Michel Lespinasse authored
      We have many vma manipulation functions that are fast in the typical
      case, but can optionally be instructed to populate an unbounded number
      of ptes within the region they work on:
      
       - mmap with MAP_POPULATE or MAP_LOCKED flags;
       - remap_file_pages() with MAP_NONBLOCK not set or when working on a
         VM_LOCKED vma;
       - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
         effect;
       - brk() when mlock(MCL_FUTURE) is in effect.
      
      Current code handles these pte operations locally, while the
      sourrounding code has to hold the mmap_sem write side since it's
      manipulating vmas.  This means we're doing an unbounded amount of pte
      population work with mmap_sem held, and this causes problems as Andy
      Lutomirski reported (we've hit this at Google as well, though it's not
      entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
      first place).
      
      I propose introducing a new mm_populate() function to do this pte
      population work after the mmap_sem has been released.  mm_populate()
      does need to acquire the mmap_sem read side, but critically, it doesn't
      need to hold it continuously for the entire duration of the operation -
      it can drop it whenever things take too long (such as when hitting disk
      for a file read) and re-acquire it later on.
      
      The following patches are included
      
      - Patches 1 fixes some issues I noticed while working on the existing code.
        If needed, they could potentially go in before the rest of the patches.
      
      - Patch 2 introduces the new mm_populate() function and changes
        mmap_region() call sites to use it after they drop mmap_sem. This is
        inspired from Andy Lutomirski's proposal and is built as an extension
        of the work I had previously done for mlock() and mlockall() around
        v2.6.38-rc1. I had tried doing something similar at the time but had
        given up as there were so many do_mmap() call sites; the recent cleanups
        by Linus and Viro are a tremendous help here.
      
      - Patches 3-5 convert some of the less-obvious places doing unbounded
        pte populates to the new mm_populate() mechanism.
      
      - Patches 6-7 are code cleanups that are made possible by the
        mm_populate() work. In particular, they remove more code than the
        entire patch series added, which should be a good thing :)
      
      - Patch 8 is optional to this entire series. It only helps to deal more
        nicely with racy userspace programs that might modify their mappings
        while we're trying to populate them. It adds a new VM_POPULATE flag
        on the mappings we do want to populate, so that if userspace replaces
        them with mappings it doesn't want populated, mm_populate() won't
        populate those replacement mappings.
      
      This patch:
      
      Assorted small fixes. The first two are quite small:
      
      - Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
        within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
        Purely cosmetic.
      
      - In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
        range, make sure we own the mmap_sem write lock around the
        munlock_vma_pages_range call as this manipulates the vma's vm_flags.
      
      Last fix requires a longer explanation. remap_file_pages() can do its work
      either through VM_NONLINEAR manipulation or by creating extra vmas.
      These two cases were inconsistent with each other (and ultimately, both wrong)
      as to exactly when did they fault in the newly mapped file pages:
      
      - In the VM_NONLINEAR case, new file pages would be populated if
        the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
        new file pages wouldn't be populated even if the vma is already
        marked as VM_LOCKED.
      
      - In the linear (emulated) case, the work is passed to the mmap_region()
        function which would populate the pages if the vma is marked as
        VM_LOCKED, and would not otherwise - regardless of the value of the
        MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
        mmap_region().
      
      The desired behavior is that we want the pages to be populated and locked
      if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
      flag is not passed to remap_file_pages().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      940e7da5
    • Zlatko Calusic's avatar
      mm: avoid calling pgdat_balanced() needlessly · dafcb73e
      Zlatko Calusic authored
      Now that balance_pgdat() is slightly tidied up, thanks to more capable
      pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
      check the status, then break the loop if pgdat is balanced, just to be
      immediately called again.  The second call is completely unnecessary, of
      course.
      
      The patch introduces pgdat_is_balanced boolean, which helps resolve the
      above suboptimal behavior, with the added benefit of slightly better
      documenting one other place in the function where we jump and skip lots
      of code.
      Signed-off-by: default avatarZlatko Calusic <zlatko.calusic@iskon.hr>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dafcb73e
    • Andrew Morton's avatar
      mm: compaction: make __compact_pgdat() and compact_pgdat() return void · 7103f16d
      Andrew Morton authored
      These functions always return 0.  Formalise this.
      
      Cc: Jason Liu <r64343@freescale.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7103f16d
    • Shaohua Li's avatar
      mm: make madvise(MADV_WILLNEED) support swap file prefetch · 1998cc04
      Shaohua Li authored
      Make madvise(MADV_WILLNEED) support swap file prefetch.  If memory is
      swapout, this syscall can do swapin prefetch.  It has no impact if the
      memory isn't swapout.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      [sasha.levin@oracle.com: fix BUG on madvise early failure]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1998cc04
    • Michal Hocko's avatar
      memcg,vmscan: do not break out targeted reclaim without reclaimed pages · a394cb8e
      Michal Hocko authored
      Targeted (hard resp soft) reclaim has traditionally tried to scan one
      group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
      pages) is reclaimed or all priorities are exhausted.  The reclaim is
      then retried until the limit is met.
      
      This approach, however, doesn't work well with deeper hierarchies where
      groups higher in the hierarchy do not have any or only very few pages
      (this usually happens if those groups do not have any tasks and they
      have only re-parented pages after some of their children is removed).
      Those groups are reclaimed with decreasing priority pointlessly as there
      is nothing to reclaim from them.
      
      An easiest fix is to break out of the memcg iteration loop in
      shrink_zone only if the whole hierarchy has been visited or sufficient
      pages have been reclaimed.  This is also more natural because the
      reclaimer expects that the hierarchy under the given root is reclaimed.
      As a result we can simplify the soft limit reclaim which does its own
      iteration.
      
      [yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
      [akpm@linux-foundation.org: use conventional comparison order]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarYing Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a394cb8e
    • Sasha Levin's avatar
      mm/ksm.c: use new hashtable implementation · 4ca3a69b
      Sasha Levin authored
      Switch ksm to use the new hashtable implementation.  This reduces the
      amount of generic unrelated code in the ksm module.
      Signed-off-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ca3a69b
    • Sasha Levin's avatar
      mm/huge_memory.c: use new hashtable implementation · 43b5fbbd
      Sasha Levin authored
      Switch hugemem to use the new hashtable implementation.  This reduces
      the amount of generic unrelated code in the hugemem.
      
      This also removes the dymanic allocation of the hash table.  The upside
      is that we save a pointer dereference when accessing the hashtable, but
      we lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor
      doesn't support hugepages.
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43b5fbbd
    • Mel Gorman's avatar
      mm: compaction: do not accidentally skip pageblocks in the migrate scanner · a9aacbcc
      Mel Gorman authored
      Compaction uses the ALIGN macro incorrectly with the migrate scanner by
      adding pageblock_nr_pages to a PFN.  It happened to work when initially
      implemented as the starting PFN was also aligned but with caching
      restarts and isolating in smaller chunks this is no longer always true.
      
      The impact is that the migrate scanner scans outside its current
      pageblock.  As pfn_valid() is still checked properly it does not cause
      any failure and the impact of the bug is that in some cases it will scan
      more than necessary when it crosses a page boundary but by no more than
      COMPACT_CLUSTER_MAX.  It is highly unlikely this is even measurable but
      it's still wrong so this patch addresses the problem.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9aacbcc