• Robert Becker's avatar
    raid: improve MD/raid10 handling of correctable read errors. · 1e50915f
    Robert Becker authored
    We've noticed severe lasting performance degradation of our raid
    arrays when we have drives that yield large amounts of media errors.
    The raid10 module will queue each failed read for retry, and also
    will attempt call fix_read_error() to perform the read recovery.
    Read recovery is performed while the array is frozen, so repeated
    recovery attempts can degrade the performance of the array for
    extended periods of time.
    
    With this patch I propose adding a per md device max number of
    corrected read attempts.  Each rdev will maintain a count of
    read correction attempts in the rdev->read_errors field (not
    used currently for raid10). When we enter fix_read_error()
    we'll check to see when the last read error occurred, and
    divide the read error count by 2 for every hour since the
    last read error. If at that point our read error count
    exceeds the read error threshold, we'll fail the raid device.
    
    In addition in this patch I add sysfs nodes (get/set) for
    the per md max_read_errors attribute, the rdev->read_errors
    attribute, and added some printk's to indicate when
    fix_read_error fails to repair an rdev.
    
    For testing I used debugfs->fail_make_request to inject
    IO errors to the rdev while doing IO to the raid array.
    Signed-off-by: default avatarRobert Becker <Rob.Becker@riverbed.com>
    Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    1e50915f
md.c 184 KB