• Qu Wenruo's avatar
    btrfs: scrub: Fix RAID56 recovery race condition · 28d70e23
    Qu Wenruo authored
    When scrubbing a RAID5 which has recoverable data corruption (only one
    data stripe is corrupted), sometimes scrub will report more csum errors
    than expected. Sometimes even unrecoverable error will be reported.
    
    The problem can be easily reproduced by the following steps:
    1) Create a btrfs with RAID5 data profile with 3 devs
    2) Mount it with nospace_cache or space_cache=v2
       To avoid extra data space usage.
    3) Create a 128K file and sync the fs, unmount it
       Now the 128K file lies at the beginning of the data chunk
    4) Locate the physical bytenr of data chunk on dev3
       Dev3 is the 1st data stripe.
    5) Corrupt the first 64K of the data chunk stripe on dev3
    6) Mount the fs and scrub it
    
    The correct csum error number should be 16 (assuming using x86_64).
    Larger csum error number can be reported in a 1/3 chance.
    And unrecoverable error can also be reported in a 1/10 chance.
    
    The root cause of the problem is RAID5/6 recover code has race
    condition, due to the fact that full scrub is initiated per device.
    
    While for other mirror based profiles, each mirror is independent with
    each other, so race won't cause any big problem.
    
    For example:
            Corrupted       |       Correct          |      Correct        |
    |   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
    ------------------------------------------------------------------------
    Read out D1             |Read out D2             |Read full stripe     |
    Check csum              |Check csum              |Check parity         |
    Csum mismatch           |Csum match, continue    |Parity mismatch      |
    handle_errored_block    |                        |handle_errored_block |
     Read out full stripe   |                        | Read out full stripe|
     D1 csum error(err++)   |                        | D1 csum error(err++)|
     Recover D1             |                        | Recover D1          |
    
    So D1's csum error is accounted twice, just because
    handle_errored_block() doesn't have enough protection, and race can happen.
    
    On even worse case, for example D1's recovery code is re-writing
    D1/D2/P, and P's recovery code is just reading out full stripe, then we
    can cause unrecoverable error.
    
    This patch will use previously introduced lock_full_stripe() and
    unlock_full_stripe() to protect the whole scrub_handle_errored_block()
    function for RAID56 recovery.
    So no extra csum error nor unrecoverable error.
    Reported-by: default avatarGoffredo Baroncelli <kreijack@libero.it>
    Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
    Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    28d70e23
scrub.c 122 KB