• Josef Bacik's avatar
    btrfs: wait on uncached block groups on every allocation loop · cd361199
    Josef Bacik authored
    My initial fix for the generic/475 hangs was related to metadata, but
    our CI testing uncovered another case where we hang for similar reasons.
    We again have a task with a plug that is holding an outstanding request
    that is keeping the dm device from finishing it's suspend, and that task
    is stuck in the allocator.
    
    This time it is stuck trying to allocate data, but we do not have a
    block group that matches the size class.  The larger loop in the
    allocator looks like this (simplified of course)
    
      find_free_extent
        for_each_block_group {
          ffe_ctl->cached == btrfs_block_group_cache_done(bg)
          if (!ffe_ctl->cached)
    	ffe_ctl->have_caching_bg = true;
          do_allocation()
    	btrfs_wait_block_group_cache_progress();
        }
    
        if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
          go search again;
    
    In my earlier fix we were trying to allocate from the block group, but
    we weren't waiting for the progress because we were only waiting for the
    free space to be >= the amount of free space we wanted.  My fix made it
    so we waited for forward progress to be made as well, so we would be
    sure to wait.
    
    This time however we did not have a block group that matched our size
    class, so what was happening was this
    
      find_free_extent
        for_each_block_group {
          ffe_ctl->cached == btrfs_block_group_cache_done(bg)
          if (!ffe_ctl->cached)
    	ffe_ctl->have_caching_bg = true;
          if (size_class_doesn't_match())
    	goto loop;
          do_allocation()
    	btrfs_wait_block_group_cache_progress();
      loop:
          release_block_group(block_group);
        }
    
        if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
          go search again;
    
    The size_class_doesn't_match() part was true, so we'd just skip this
    block group and never wait for caching, and then because we found a
    caching block group we'd just go back and do the loop again.  We never
    sleep and thus never flush the plug and we have the same deadlock.
    
    Fix the logic for waiting on the block group caching to instead do it
    unconditionally when we goto loop.  This takes the logic out of the
    allocation step, so now the loop looks more like this
    
      find_free_extent
        for_each_block_group {
          ffe_ctl->cached == btrfs_block_group_cache_done(bg)
          if (!ffe_ctl->cached)
    	ffe_ctl->have_caching_bg = true;
          if (size_class_doesn't_match())
    	goto loop;
          do_allocation()
    	btrfs_wait_block_group_cache_progress();
      loop:
          if (loop > LOOP_CACHING_NOWAIT && !ffe_ctl->retry_uncached &&
    	  !ffe_ctl->cached) {
    	 ffe_ctl->retry_uncached = true;
    	 btrfs_wait_block_group_cache_progress();
          }
    
          release_block_group(block_group);
        }
    
        if (loop == LOOP_CACHING_WAIT && ffe_ctl->have_caching_bg)
          go search again;
    
    This simplifies the logic a lot, and makes sure that if we're hitting
    uncached block groups we're always waiting on them at some point.
    
    I ran this through 100 iterations of generic/475, as this particular
    case was harder to hit than the previous one.
    Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    cd361199
extent-tree.h 4.89 KB