• Zhao Heming's avatar
    md/cluster: block reshape with remote resync job · a8da01f7
    Zhao Heming authored
    Reshape request should be blocked with ongoing resync job. In cluster
    env, a node can start resync job even if the resync cmd isn't executed
    on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
    will start resync job. However, current update_raid_disks() only check
    local recovery status, which is incomplete. As a result, we see user will
    execute "mdadm --grow" successfully on local, while the remote node deny
    to do reshape job when it doing resync job. The inconsistent handling
    cause array enter unexpected status. If user doesn't observe this issue
    and continue executing mdadm cmd, the array doesn't work at last.
    
    Fix this issue by blocking reshape request. When node executes "--grow"
    and detects ongoing resync, it should stop and report error to user.
    
    The following script reproduces the issue with ~100% probability.
    (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
    ```
     # on node1, node2 is the remote node.
    ssh root@node2 "mdadm -S --scan"
    mdadm -S --scan
    for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
    count=20; done
    
    mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
    ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
    
    sleep 5
    
    mdadm --manage --add /dev/md0 /dev/sdi
    mdadm --wait /dev/md0
    mdadm --grow --raid-devices=3 /dev/md0
    
    mdadm /dev/md0 --fail /dev/sdg
    mdadm /dev/md0 --remove /dev/sdg
    mdadm --grow --raid-devices=2 /dev/md0
    ```
    
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarZhao Heming <heming.zhao@suse.com>
    Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
    a8da01f7
md.c 259 KB