• Yu Kuai's avatar
    md/raid5: avoid BUG_ON() while continue reshape after reassembling · 305a5170
    Yu Kuai authored
    Currently, mdadm support --revert-reshape to abort the reshape while
    reassembling, as the test 07revert-grow. However, following BUG_ON()
    can be triggerred by the test:
    
    kernel BUG at drivers/md/raid5.c:6278!
    invalid opcode: 0000 [#1] PREEMPT SMP PTI
    irq event stamp: 158985
    CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94
    RIP: 0010:reshape_request+0x3f1/0xe60
    Call Trace:
     <TASK>
     raid5_sync_request+0x43d/0x550
     md_do_sync+0xb7a/0x2110
     md_thread+0x294/0x2b0
     kthread+0x147/0x1c0
     ret_from_fork+0x59/0x70
     ret_from_fork_asm+0x1a/0x30
     </TASK>
    
    Root cause is that --revert-reshape update the raid_disks from 5 to 4,
    while reshape position is still set, and after reassembling the array,
    reshape position will be read from super block, then during reshape the
    checking of 'writepos' that is caculated by old reshape position will
    fail.
    
    Fix this panic the easy way first, by converting the BUG_ON() to
    WARN_ON(), and stop the reshape if checkings fail.
    
    Noted that mdadm must fix --revert-shape as well, and probably md/raid
    should enhance metadata validation as well, however this means
    reassemble will fail and there must be user tools to fix the wrong
    metadata.
    Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
    Signed-off-by: default avatarSong Liu <song@kernel.org>
    Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com
    305a5170
raid5.c 253 KB