• Dave Chinner's avatar
    xfs: check sb_meta_uuid for dabuf buffer recovery · 09654ed8
    Dave Chinner authored
    Got a report that a repeated crash test of a container host would
    eventually fail with a log recovery error preventing the system from
    mounting the root filesystem. It manifested as a directory leaf node
    corruption on writeback like so:
    
     XFS (loop0): Mounting V5 Filesystem
     XFS (loop0): Starting recovery (logdev: internal)
     XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
     XFS (loop0): Unmount and run xfs_repair
     XFS (loop0): First 128 bytes of corrupted metadata buffer:
     00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b  ........=.......
     00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc  .......X...)....
     00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23  ..x..~J}.S...G.#
     00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00  .........C......
     00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a  ................
     00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50  .5y....0.......P
     00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4  .@.......A......
     00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c  .b.......P!A....
     XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514).  Shutting down.
     XFS (loop0): Please unmount the filesystem and rectify the problem(s)
     XFS (loop0): log mount/recovery failed: error -117
     XFS (loop0): log mount failed
    
    Tracing indicated that we were recovering changes from a transaction
    at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57.
    That is, log recovery was overwriting a buffer with newer changes on
    disk than was in the transaction. Tracing indicated that we were
    hitting the "recovery immediately" case in
    xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the
    buffer.
    
    The code was extracting the LSN correctly, then ignoring it because
    the UUID in the buffer did not match the superblock UUID. The
    problem arises because the UUID check uses the wrong UUID - it
    should be checking the sb_meta_uuid, not sb_uuid. This filesystem
    has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the
    correct matching sb_meta_uuid in it, it's just the code checked it
    against the wrong superblock uuid.
    
    The is no corruption in the filesystem, and failing to recover the
    buffer due to a write verifier failure means the recovery bug did
    not propagate the corruption to disk. Hence there is no corruption
    before or after this bug has manifested, the impact is limited
    simply to an unmountable filesystem....
    
    This was missed back in 2015 during an audit of incorrect sb_uuid
    usage that resulted in commit fcfbe2c4 ("xfs: log recovery needs
    to validate against sb_meta_uuid") that fixed the magic32 buffers to
    validate against sb_meta_uuid instead of sb_uuid. It missed the
    magicda buffers....
    
    Fixes: ce748eaa ("xfs: create new metadata UUID field and incompat flag")
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    09654ed8
xfs_buf_item_recover.c 28.3 KB