• Tejun Heo's avatar
    btrfs: Avoid getting stuck during cyclic writebacks · f7bddf1e
    Tejun Heo authored
    During a cyclic writeback, extent_write_cache_pages() uses done_index
    to update the writeback_index after the current run is over.  However,
    instead of current index + 1, it gets to to the current index itself.
    
    Unfortunately, this, combined with returning on EOF instead of looping
    back, can lead to the following pathlogical behavior.
    
    1. There is a single file which has accumulated enough dirty pages to
       trigger balance_dirty_pages() and the writer appending to the file
       with a series of short writes.
    
    2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
    
    3. Writeback kicks in and the cursor is on the last page of the dirty
       file.  Writeback is started or skipped if already in progress.  As
       it's EOF, extent_write_cache_pages() returns and the cursor is set
       to done_index which is pointing to the last page.
    
    4. Writeback is done.  Nothing happens till balance_dirty_pages
       finishes, at which point we go back to #1.
    
    This can almost completely stall out writing back of the file and keep
    the system over dirty threshold for a long time which can mess up the
    whole system.  We encountered this issue in production with a package
    handling application which can reliably reproduce the issue when
    running under tight memory limits.
    
    Reading the comment in the error handling section, this seems to be to
    avoid accidentally skipping a page in case the write attempt on the
    page doesn't succeed.  However, this concern seems bogus.
    
    On each page, the code either:
    
    * Skips and moves onto the next page.
    
    * Fails issue and sets done_index to index + 1.
    
    * Successfully issues and continue to the next page if budget allows
      and not EOF.
    
    IOW, as long as it's not EOF and there's budget, the code never
    retries writing back the same page.  Only when a page happens to be
    the last page of a particular run, we end up retrying the page, which
    can't possibly guarantee anything data integrity related.  Besides,
    cyclic writes are only used for non-syncing writebacks meaning that
    there's no data integrity implication to begin with.
    
    Fix it by always setting done_index past the current page being
    processed.
    
    Note that this problem exists in other writepages too.
    
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    f7bddf1e
extent_io.c 154 KB