• KOSAKI Motohiro's avatar
    x86, NUMA: Fix fakenuma boot failure · 7d6b4670
    KOSAKI Motohiro authored
    Currently, numa=fake boot parameter is broken. If it's used,
    kernel may panic due to devide by zero error depending on CPU
    configuration
    
    Call Trace:
     [<ffffffff8104ad4c>] find_busiest_group+0x38c/0xd30
     [<ffffffff81086aff>] ? local_clock+0x6f/0x80
     [<ffffffff81050533>] load_balance+0xa3/0x600
     [<ffffffff81050f53>] idle_balance+0xf3/0x180
     [<ffffffff81550092>] schedule+0x722/0x7d0
     [<ffffffff81550538>] ? wait_for_common+0x128/0x190
     [<ffffffff81550a65>] schedule_timeout+0x265/0x320
     [<ffffffff81095815>] ? lock_release_holdtime+0x35/0x1a0
     [<ffffffff81550538>] ? wait_for_common+0x128/0x190
     [<ffffffff8109bb6c>] ? __lock_release+0x9c/0x1d0
     [<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
     [<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
     [<ffffffff81550540>] wait_for_common+0x130/0x190
     [<ffffffff81051920>] ? try_to_wake_up+0x510/0x510
     [<ffffffff8155067d>] wait_for_completion+0x1d/0x20
     [<ffffffff8107f36c>] kthread_create_on_node+0xac/0x150
     [<ffffffff81077bb0>] ? process_scheduled_works+0x40/0x40
     [<ffffffff8155045f>] ? wait_for_common+0x4f/0x190
     [<ffffffff8107a283>] __alloc_workqueue_key+0x1a3/0x590
     [<ffffffff81e0cce2>] cpuset_init_smp+0x6b/0x7b
     [<ffffffff81df3d07>] kernel_init+0xc3/0x182
     [<ffffffff8155d5e4>] kernel_thread_helper+0x4/0x10
     [<ffffffff81553cd4>] ? retint_restore_args+0x13/0x13
     [<ffffffff81df3c44>] ? start_kernel+0x400/0x400
     [<ffffffff8155d5e0>] ? gs_change+0x13/0x13
    
    The divede by zero is caused by the following line,
    group->cpu_power==0:
    
     kernel/sched_fair.c::update_sg_lb_stats()
            /* Adjust by relative CPU power of the group */
            sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
    
    This regression was caused by commit e23bba60 ("x86-64, NUMA: Unify
    emulated distance mapping") because it changes cpu -> node
    mapping in the process of dropping fake_physnodes().
    
      old) all cpus are assinged node 0
      now) cpus are assigned round robin
           (the logic is implemented by numa_init_array())
    
      Note: The change in behavior only happens if the system doesn't
            have neither ACPI SRAT table nor AMD northbridge NUMA
    	information.
    
    Round robin assignment doesn't work because init_numa_sched_groups_power()
    assumes all logical cpus in the same physical cpu share the same node
    (then it only accounts for group_first_cpu()), and the simple round robin
    breaks the above assumption.
    
    Thus, this patch implements a reassignment of node-ids if buggy firmware
    or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
    in the same physical cpu share the same node.
    Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Acked-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Yinghai Lu <yinghai@kernel.org>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Cyrill Gorcunov <gorcunov@gmail.com>
    Cc: Shaohui Zheng <shaohui.zheng@intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: H. Peter Anvin <hpa@linux.intel.com>
    Link: http://lkml.kernel.org/r/20110415203928.1303.A69D9226@jp.fujitsu.comSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
    7d6b4670
smpboot.c 35.2 KB