• Boris Burkov's avatar
    btrfs: make cow_file_range_inline() honor locked_page on error · 47857437
    Boris Burkov authored
    The btrfs buffered write path runs through __extent_writepage() which
    has some tricky return value handling for writepage_delalloc().
    Specifically, when that returns 1, we exit, but for other return values
    we continue and end up calling btrfs_folio_end_all_writers(). If the
    folio has been unlocked (note that we check the PageLocked bit at the
    start of __extent_writepage()), this results in an assert panic like
    this one from syzbot:
    
      BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
      BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
      BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
      assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/subpage.c:871!
      Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
      6.10.0-syzkaller-05505-gb1bc554e #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 06/27/2024
      RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
      Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
      0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
      6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
      RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
      RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
      RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
      RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
      R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
      R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
      FS:  0000000000000000(0000) GS:ffff8880b9500000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
      <TASK>
      __extent_writepage fs/btrfs/extent_io.c:1597 [inline]
      extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
      btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
      do_writepages+0x359/0x870 mm/page-writeback.c:2656
      filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
      __filemap_fdatawrite_range mm/filemap.c:430 [inline]
      __filemap_fdatawrite mm/filemap.c:436 [inline]
      filemap_flush+0xdf/0x130 mm/filemap.c:463
      btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
      __fput+0x24a/0x8a0 fs/file_table.c:422
      task_work_run+0x24f/0x310 kernel/task_work.c:222
      exit_task_work include/linux/task_work.h:40 [inline]
      do_exit+0xa2f/0x27f0 kernel/exit.c:877
      do_group_exit+0x207/0x2c0 kernel/exit.c:1026
      __do_sys_exit_group kernel/exit.c:1037 [inline]
      __se_sys_exit_group kernel/exit.c:1035 [inline]
      __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
      x64_sys_call+0x2634/0x2640
      arch/x86/include/generated/asm/syscalls_64.h:232
      do_syscall_x64 arch/x86/entry/common.c:52 [inline]
      do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
      entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7f5f075b70c9
      Code: Unable to access opcode bytes at
      0x7f5f075b709f.
    
    I was hitting the same issue by doing hundreds of accelerated runs of
    generic/475, which also hits IO errors by design.
    
    I instrumented that reproducer with bpftrace and found that the
    undesirable folio_unlock was coming from the following callstack:
    
      folio_unlock+5
      __process_pages_contig+475
      cow_file_range_inline.constprop.0+230
      cow_file_range+803
      btrfs_run_delalloc_range+566
      writepage_delalloc+332
      __extent_writepage # inlined in my stacktrace, but I added it here
      extent_write_cache_pages+622
    
    Looking at the bisected-to patch in the syzbot report, Josef realized
    that the logic of the cow_file_range_inline error path subtly changing.
    In the past, on error, it jumped to out_unlock in cow_file_range(),
    which honors the locked_page, so when we ultimately call
    folio_end_all_writers(), the folio of interest is still locked. After
    the change, we always unlocked ignoring the locked_page, on both success
    and error. On the success path, this all results in returning 1 to
    __extent_writepage(), which skips the folio_end_all_writers() call,
    which makes it OK to have unlocked.
    
    Fix the bug by wiring the locked_page into cow_file_range_inline() and
    only setting locked_page to NULL on success.
    
    Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com
    Fixes: 0586d0a8 ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
    CC: stable@vger.kernel.org # 6.10+
    Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarBoris Burkov <boris@bur.io>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    47857437
inode.c 288 KB