• Filipe Manana's avatar
    Btrfs: fix quick exhaustion of the system array in the superblock · 00d80e34
    Filipe Manana authored
    Omar reported that after commit 4fbcdf66 ("Btrfs: fix -ENOSPC when
    finishing block group creation"), introduced in 4.2-rc1, the following
    test was failing due to exhaustion of the system array in the superblock:
    
      #!/bin/bash
    
      truncate -s 100T big.img
      mkfs.btrfs big.img
      mount -o loop big.img /mnt/loop
    
      num=5
      sz=10T
      for ((i = 0; i < $num; i++)); do
          echo fallocate $i $sz
          fallocate -l $sz /mnt/loop/testfile$i
      done
      btrfs filesystem sync /mnt/loop
    
      for ((i = 0; i < $num; i++)); do
            echo rm $i
            rm /mnt/loop/testfile$i
            btrfs filesystem sync /mnt/loop
      done
      umount /mnt/loop
    
    This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
    allocation of system block groups. This happened because the test creates
    a large number of data block groups per transaction and when committing
    the transaction we start the writeout of the block group caches for all
    the new new (dirty) block groups, which results in pre-allocating space
    for each block group's free space cache using the same transaction handle.
    That in turn often leads to creation of more block groups, and all get
    attached to the new_bgs list of the same transaction handle to the point
    of getting a list with over 1500 elements, and creation of new block groups
    leads to the need of reserving space in the chunk block reserve and often
    creating a new system block group too.
    
    So that made us quickly exhaust the chunk block reserve/system space info,
    because as of the commit mentioned before, we do reserve space for each
    new block group in the chunk block reserve, unlike before where we would
    not and would at most allocate one new system block group and therefore
    would only ensure that there was enough space in the system space info to
    allocate 1 new block group even if we ended up allocating thousands of
    new block groups using the same transaction handle. That worked most of
    the time because the computed required space at check_system_chunk() is
    very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
    that all nodes/leafs in a path will be COWed and split) and since the
    updates to the chunk tree all happen at btrfs_create_pending_block_groups
    it is unlikely that a path needs to be COWed more than once (unless
    writepages() for the btree inode is called by mm in between) and that
    compensated for the need of creating any new nodes/leads in the chunk
    tree.
    
    So fix this by ensuring we don't accumulate a too large list of new block
    groups in a transaction's handles new_bgs list, inserting/updating the
    chunk tree for all accumulated new block groups and releasing the unused
    space from the chunk block reserve whenever the list becomes sufficiently
    large. This is a generic solution even though the problem currently can
    only happen when starting the writeout of the free space caches for all
    dirty block groups (btrfs_start_dirty_block_groups()).
    Reported-by: default avatarOmar Sandoval <osandov@fb.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Tested-by: default avatarOmar Sandoval <osandov@fb.com>
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    00d80e34
extent-tree.c 273 KB