• Dave Chinner's avatar
    mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock · 64081362
    Dave Chinner authored
    We've recently seen a workload on XFS filesystems with a repeatable
    deadlock between background writeback and a multi-process application
    doing concurrent writes and fsyncs to a small range of a file.
    
    range_cyclic
    writeback		Process 1		Process 2
    
    xfs_vm_writepages
      write_cache_pages
        writeback_index = 2
        cycled = 0
        ....
        find page 2 dirty
        lock Page 2
        ->writepage
          page 2 writeback
          page 2 clean
          page 2 added to bio
        no more pages
    			write()
    			locks page 1
    			dirties page 1
    			locks page 2
    			dirties page 1
    			fsync()
    			....
    			xfs_vm_writepages
    			write_cache_pages
    			  start index 0
    			  find page 1 towrite
    			  lock Page 1
    			  ->writepage
    			    page 1 writeback
    			    page 1 clean
    			    page 1 added to bio
    			  find page 2 towrite
    			  lock Page 2
    			  page 2 is writeback
    			  <blocks>
    						write()
    						locks page 1
    						dirties page 1
    						fsync()
    						....
    						xfs_vm_writepages
    						write_cache_pages
    						  start index 0
    
        !done && !cycled
          sets index to 0, restarts lookup
        find page 1 dirty
    						  find page 1 towrite
    						  lock Page 1
    						  page 1 is writeback
    						  <blocks>
    
        lock Page 1
        <blocks>
    
    DEADLOCK because:
    
    	- process 1 needs page 2 writeback to complete to make
    	  enough progress to issue IO pending for page 1
    	- writeback needs page 1 writeback to complete so process 2
    	  can progress and unlock the page it is blocked on, then it
    	  can issue the IO pending for page 2
    	- process 2 can't make progress until process 1 issues IO
    	  for page 1
    
    The underlying cause of the problem here is that range_cyclic writeback is
    processing pages in descending index order as we hold higher index pages
    in a structure controlled from above write_cache_pages().  The
    write_cache_pages() caller needs to be able to submit these pages for IO
    before write_cache_pages restarts writeback at mapping index 0 to avoid
    wcp inverting the page lock/writeback wait order.
    
    generic_writepages() is not susceptible to this bug as it has no private
    context held across write_cache_pages() - filesystems using this
    infrastructure always submit pages in ->writepage immediately and so there
    is no problem with range_cyclic going back to mapping index 0.
    
    However:
    	mpage_writepages() has a private bio context,
    	exofs_writepages() has page_collect
    	fuse_writepages() has fuse_fill_wb_data
    	nfs_writepages() has nfs_pageio_descriptor
    	xfs_vm_writepages() has xfs_writepage_ctx
    
    All of these ->writepages implementations can hold pages under writeback
    in their private structures until write_cache_pages() returns, and hence
    they are all susceptible to this deadlock.
    
    Also worth noting is that ext4 has it's own bastardised version of
    write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
    at the code long enough to understand that it has a similar retry loop for
    range_cyclic writeback reaching the end of the file and then promptly ran
    away before my eyes bled too much.  I'll leave it for the ext4 developers
    to determine if their code is actually has this deadlock and how to fix it
    if it has.
    
    There's a few ways I can see avoid this deadlock.  There's probably more,
    but these are the first I've though of:
    
    1. get rid of range_cyclic altogether
    
    2. range_cyclic always stops at EOF, and we start again from
    writeback index 0 on the next call into write_cache_pages()
    
    2a. wcp also returns EAGAIN to ->writepages implementations to
    indicate range cyclic has hit EOF. writepages implementations can
    then flush the current context and call wpc again to continue. i.e.
    lift the retry into the ->writepages implementation
    
    3. range_cyclic uses trylock_page() rather than lock_page(), and it
    skips pages it can't lock without blocking. It will already do this
    for pages under writeback, so this seems like a no-brainer
    
    3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
    blocking as per pages under writeback.
    
    I don't think #1 is an option - range_cyclic prevents frequently
    dirtied lower file offset from starving background writeback of
    rarely touched higher file offsets.
    
    #2 is simple, and I don't think it will have any impact on
    performance as going back to the start of the file implies an
    immediate seek. We'll have exactly the same number of seeks if we
    switch writeback to another inode, and then come back to this one
    later and restart from index 0.
    
    #2a is pretty much "status quo without the deadlock". Moving the
    retry loop up into the wcp caller means we can issue IO on the
    pending pages before calling wcp again, and so avoid locking or
    waiting on pages in the wrong order. I'm not convinced we need to do
    this given that we get the same thing from #2 on the next writeback
    call from the writeback infrastructure.
    
    #3 is really just a band-aid - it doesn't fix the access/wait
    inversion problem, just prevents it from becoming a deadlock
    situation. I'd prefer we fix the inversion, not sweep it under the
    carpet like this.
    
    #3a is really an optimisation that just so happens to include the
    band-aid fix of #3.
    
    So it seems that the simplest way to fix this issue is to implement
    solution #2
    
    Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.comSigned-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarJan Kara <jack@suse.de>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    64081362
page-writeback.c 84.6 KB