• Naoya Horiguchi's avatar
    mm/memory-failure.c: shift page lock from head page to tail page after thp split · 54b9dd14
    Naoya Horiguchi authored
    After thp split in hwpoison_user_mappings(), we hold page lock on the
    raw error page only between try_to_unmap, hence we are in danger of race
    condition.
    
    I found in the RHEL7 MCE-relay testing that we have "bad page" error
    when a memory error happens on a thp tail page used by qemu-kvm:
    
      Triggering MCE exception on CPU 10
      mce: [Hardware Error]: Machine check events logged
      MCE exception done on CPU 10
      MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
      MCE 0x38c535: dirty LRU page recovery: Recovered
      qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
      BUG: Bad page state in process qemu-kvm  pfn:38c400
      page:ffffea000e310000 count:0 mapcount:0 mapping:          (null) index:0x7ffae3c00
      page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
      Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
      CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G   M        --------------   3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
      Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
      Call Trace:
        dump_stack+0x19/0x1b
        bad_page.part.59+0xcf/0xe8
        free_pages_prepare+0x148/0x160
        free_hot_cold_page+0x31/0x140
        free_hot_cold_page_list+0x46/0xa0
        release_pages+0x1c1/0x200
        free_pages_and_swap_cache+0xad/0xd0
        tlb_flush_mmu.part.46+0x4c/0x90
        tlb_finish_mmu+0x55/0x60
        exit_mmap+0xcb/0x170
        mmput+0x67/0xf0
        vhost_dev_cleanup+0x231/0x260 [vhost_net]
        vhost_net_release+0x3f/0x90 [vhost_net]
        __fput+0xe9/0x270
        ____fput+0xe/0x10
        task_work_run+0xc4/0xe0
        do_exit+0x2bb/0xa40
        do_group_exit+0x3f/0xa0
        get_signal_to_deliver+0x1d0/0x6e0
        do_signal+0x48/0x5e0
        do_notify_resume+0x71/0xc0
        retint_signal+0x48/0x8c
    
    The reason of this bug is that a page fault happens before unlocking the
    head page at the end of memory_failure().  This strange page fault is
    trying to access to address 0x20 and I'm not sure why qemu-kvm does
    this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
    way we catch the bad page bug/warning because we try to free a locked
    page (which was the former head page.)
    
    To fix this, this patch suggests to shift page lock from head page to
    tail page just after thp split.  SIGSEGV still happens, but it affects
    only error affected VMs, not a whole system.
    Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Andi Kleen <andi@firstfloor.org>
    Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
    Cc: <stable@vger.kernel.org>        [3.9+] # a3e0f9e4 "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    54b9dd14
memory-failure.c 46.4 KB