• Linus Torvalds's avatar
    mm: make wait_on_page_writeback() wait for multiple pending writebacks · c2407cf7
    Linus Torvalds authored
    Ever since commit 2a9127fc ("mm: rewrite wait_on_page_bit_common()
    logic") we've had some very occasional reports of BUG_ON(PageWriteback)
    in write_cache_pages(), which we thought we already fixed in commit
    073861ed ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").
    
    But syzbot just reported another one, even with that commit in place.
    
    And it turns out that there's a simpler way to trigger the BUG_ON() than
    the one Hugh found with page re-use.  It all boils down to the fact that
    the page writeback is ostensibly serialized by the page lock, but that
    isn't actually really true.
    
    Yes, the people _setting_ writeback all do so under the page lock, but
    the actual clearing of the bit - and waking up any waiters - happens
    without any page lock.
    
    This gives us this fairly simple race condition:
    
      CPU1 = end previous writeback
      CPU2 = start new writeback under page lock
      CPU3 = write_cache_pages()
    
      CPU1          CPU2            CPU3
      ----          ----            ----
    
      end_page_writeback()
        test_clear_page_writeback(page)
        ... delayed...
    
                    lock_page();
                    set_page_writeback()
                    unlock_page()
    
                                    lock_page()
                                    wait_on_page_writeback();
    
        wake_up_page(page, PG_writeback);
        .. wakes up CPU3 ..
    
                                    BUG_ON(PageWriteback(page));
    
    where the BUG_ON() happens because we woke up the PG_writeback bit
    becasue of the _previous_ writeback, but a new one had already been
    started because the clearing of the bit wasn't actually atomic wrt the
    actual wakeup or serialized by the page lock.
    
    The reason this didn't use to happen was that the old logic in waiting
    on a page bit would just loop if it ever saw the bit set again.
    
    The nice proper fix would probably be to get rid of the whole "wait for
    writeback to clear, and then set it" logic in the writeback path, and
    replace it with an atomic "wait-to-set" (ie the same as we have for page
    locking: we set the page lock bit with a single "lock_page()", not with
    "wait for lock bit to clear and then set it").
    
    However, out current model for writeback is that the waiting for the
    writeback bit is done by the generic VFS code (ie write_cache_pages()),
    but the actual setting of the writeback bit is done much later by the
    filesystem ".writepages()" function.
    
    IOW, to make the writeback bit have that same kind of "wait-to-set"
    behavior as we have for page locking, we'd have to change our roughly
    ~50 different writeback functions.  Painful.
    
    Instead, just make "wait_on_page_writeback()" loop on the very unlikely
    situation that the PG_writeback bit is still set, basically re-instating
    the old behavior.  This is very non-optimal in case of contention, but
    since we only ever set the bit under the page lock, that situation is
    controlled.
    
    Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
    Fixes: 2a9127fc ("mm: rewrite wait_on_page_bit_common() logic")
    Acked-by: default avatarHugh Dickins <hughd@google.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: stable@kernel.org
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    c2407cf7
page-writeback.c 84.6 KB