• Filipe Manana's avatar
    btrfs: fix btrfs_prev_leaf() to not return the same key twice · 6f932d4e
    Filipe Manana authored
    A call to btrfs_prev_leaf() may end up returning a path that points to the
    same item (key) again. This happens if while btrfs_prev_leaf(), after we
    release the path, a concurrent insertion happens, which moves items off
    from a sibling into the front of the previous leaf, and an item with the
    computed previous key does not exists.
    
    For example, suppose we have the two following leaves:
    
      Leaf A
    
      -------------------------------------------------------------
      | ...   key (300 96 10)   key (300 96 15)   key (300 96 16) |
      -------------------------------------------------------------
                  slot 20             slot 21             slot 22
    
      Leaf B
    
      -------------------------------------------------------------
      | key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
      -------------------------------------------------------------
          slot 0             slot 1             slot 2
    
    If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
    a path pointing to leaf B and slot 0 and the following happens:
    
    1) At btrfs_prev_leaf() we compute the previous key to search as:
       (300 96 19), which is a key that does not exists in the tree;
    
    2) Then we call btrfs_release_path() at btrfs_prev_leaf();
    
    3) Some other task inserts a key at leaf A, that sorts before the key at
       slot 20, for example it has an objectid of 299. In order to make room
       for the new key, the key at slot 22 is moved to the front of leaf B.
       This happens at push_leaf_right(), called from split_leaf().
    
       After this leaf B now looks like:
    
      --------------------------------------------------------------------------------
      | key (300 96 16)    key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
      --------------------------------------------------------------------------------
           slot 0              slot 1             slot 2             slot 3
    
    4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
       previous key: (300 96 19). Since the key does not exists,
       btrfs_search_slot() returns 1 and with a path pointing to leaf B
       and slot 1, the item with key (300 96 20);
    
    5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
       leaf B, the same key as before it was called, since the key at slot 0
       of leaf B (300 96 16) is less than the computed previous key, which is
       (300 96 19);
    
    6) As a consequence btrfs_previous_item() returns a path that points again
       to the item with key (300 96 20).
    
    For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
    be functional a problem, despite not making sense to return a new path
    pointing again to the same item/key. However for a caller such as
    tree-log.c:log_dir_items(), this has a bad consequence, as it can result
    in not logging some dir index deletions in case the directory is being
    logged without holding the inode's VFS lock (logging triggered while
    logging a child inode for example) - for the example scenario above, in
    case the dir index keys 17, 18 and 19 were deleted in the current
    transaction.
    
    CC: stable@vger.kernel.org # 4.14+
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    6f932d4e
ctree.c 132 KB