• Boris Burkov's avatar
    btrfs: fix fatal extent_buffer readahead vs releasepage race · 8ee03580
    Boris Burkov authored
    commit 6bf9cd2e upstream.
    
    Under somewhat convoluted conditions, it is possible to attempt to
    release an extent_buffer that is under io, which triggers a BUG_ON in
    btrfs_release_extent_buffer_pages.
    
    This relies on a few different factors. First, extent_buffer reads done
    as readahead for searching use WAIT_NONE, so they free the local extent
    buffer reference while the io is outstanding. However, they should still
    be protected by TREE_REF. However, if the system is doing signficant
    reclaim, and simultaneously heavily accessing the extent_buffers, it is
    possible for releasepage to race with two concurrent readahead attempts
    in a way that leaves TREE_REF unset when the readahead extent buffer is
    released.
    
    Essentially, if two tasks race to allocate a new extent_buffer, but the
    winner who attempts the first io is rebuffed by a page being locked
    (likely by the reclaim itself) then the loser will still go ahead with
    issuing the readahead. The loser's call to find_extent_buffer must also
    race with the reclaim task reading the extent_buffer's refcount as 1 in
    a way that allows the reclaim to re-clear the TREE_REF checked by
    find_extent_buffer.
    
    The following represents an example execution demonstrating the race:
    
                CPU0                                                         CPU1                                           CPU2
    reada_for_search                                            reada_for_search
      readahead_tree_block                                        readahead_tree_block
        find_create_tree_block                                      find_create_tree_block
          alloc_extent_buffer                                         alloc_extent_buffer
                                                                      find_extent_buffer // not found
                                                                      allocates eb
                                                                      lock pages
                                                                      associate pages to eb
                                                                      insert eb into radix tree
                                                                      set TREE_REF, refs == 2
                                                                      unlock pages
                                                                  read_extent_buffer_pages // WAIT_NONE
                                                                    not uptodate (brand new eb)
                                                                                                                lock_page
                                                                    if !trylock_page
                                                                      goto unlock_exit // not an error
                                                                  free_extent_buffer
                                                                    release_extent_buffer
                                                                      atomic_dec_and_test refs to 1
            find_extent_buffer // found
                                                                                                                try_release_extent_buffer
                                                                                                                  take refs_lock
                                                                                                                  reads refs == 1; no io
              atomic_inc_not_zero refs to 2
              mark_buffer_accessed
                check_buffer_tree_ref
                  // not STALE, won't take refs_lock
                  refs == 2; TREE_REF set // no action
        read_extent_buffer_pages // WAIT_NONE
                                                                                                                  clear TREE_REF
                                                                                                                  release_extent_buffer
                                                                                                                    atomic_dec_and_test refs to 1
                                                                                                                    unlock_page
          still not uptodate (CPU1 read failed on trylock_page)
          locks pages
          set io_pages > 0
          submit io
          return
        free_extent_buffer
          release_extent_buffer
            dec refs to 0
            delete from radix tree
            btrfs_release_extent_buffer_pages
              BUG_ON(io_pages > 0)!!!
    
    We observe this at a very low rate in production and were also able to
    reproduce it in a test environment by introducing some spurious delays
    and by introducing probabilistic trylock_page failures.
    
    To fix it, we apply check_tree_ref at a point where it could not
    possibly be unset by a competing task: after io_pages has been
    incremented. All the codepaths that clear TREE_REF check for io, so they
    would not be able to clear it after this point until the io is done.
    
    Stack trace, for reference:
    [1417839.424739] ------------[ cut here ]------------
    [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
    [1417839.447024] invalid opcode: 0000 [#1] SMP
    [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
    [1417839.517008] Code: ed e9 ...
    [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
    [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
    [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
    [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
    [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
    [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
    [1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
    [1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
    [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [1417839.731320] Call Trace:
    [1417839.737103]  release_extent_buffer+0x39/0x90
    [1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
    [1417839.758645]  btrfs_search_slot+0x260/0x9b0
    [1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
    [1417839.778427]  btrfs_get_extent+0x15f/0x830
    [1417839.787665]  ? submit_extent_page+0xc4/0x1c0
    [1417839.797474]  ? __do_readpage+0x299/0x7a0
    [1417839.806515]  __do_readpage+0x33b/0x7a0
    [1417839.815171]  ? btrfs_releasepage+0x70/0x70
    [1417839.824597]  extent_readpages+0x28f/0x400
    [1417839.833836]  read_pages+0x6a/0x1c0
    [1417839.841729]  ? startup_64+0x2/0x30
    [1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
    [1417839.860590]  filemap_fault+0x6c7/0x990
    [1417839.869252]  ? xas_load+0x8/0x80
    [1417839.876756]  ? xas_find+0x150/0x190
    [1417839.884839]  ? filemap_map_pages+0x295/0x3b0
    [1417839.894652]  __do_fault+0x32/0x110
    [1417839.902540]  __handle_mm_fault+0xacd/0x1000
    [1417839.912156]  handle_mm_fault+0xaa/0x1c0
    [1417839.921004]  __do_page_fault+0x242/0x4b0
    [1417839.930044]  ? page_fault+0x8/0x30
    [1417839.937933]  page_fault+0x1e/0x30
    [1417839.945631] RIP: 0033:0x33c4bae
    [1417839.952927] Code: Bad RIP value.
    [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
    [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
    [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
    [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
    [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
    [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
    
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarBoris Burkov <boris@bur.io>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    8ee03580
extent_io.c 150 KB