• Hans van Kranenburg's avatar
    btrfs: Do not use data_alloc_cluster in ssd mode · 583b7231
    Hans van Kranenburg authored
        This patch provides a band aid to improve the 'out of the box'
    behaviour of btrfs for disks that are detected as being an ssd.  In a
    general purpose mixed workload scenario, the current ssd mode causes
    overallocation of available raw disk space for data, while leaving
    behind increasing amounts of unused fragmented free space. This
    situation leads to early ENOSPC problems which are harming user
    experience and adoption of btrfs as a general purpose filesystem.
    
    This patch modifies the data extent allocation behaviour of the ssd mode
    to make it behave identical to nossd mode.  The metadata behaviour and
    additional ssd_spread option stay untouched so far.
    
    Recommendations for future development are to reconsider the current
    oversimplified nossd / ssd distinction and the broken detection
    mechanism based on the rotational attribute in sysfs and provide
    experienced users with a more flexible way to choose allocator behaviour
    for data and metadata, optimized for certain use cases, while keeping
    sane 'out of the box' default settings.  The internals of the current
    btrfs code have more potential than what currently gets exposed to the
    user to choose from.
    
        The SSD story...
    
        In the first year of btrfs development, around early 2008, btrfs
    gained a mount option which enables specific functionality for
    filesystems on solid state devices. The first occurance of this
    functionality is in commit e18e4809, labeled "Add mount -o ssd, which
    includes optimizations for seek free storage".
    
    The effect on allocating free space for doing (data) writes is to
    'cluster' writes together, writing them out in contiguous space, as
    opposed to a 'tetris' way of putting all separate writes into any free
    space fragment that fits (which is what the -o nossd behaviour does).
    
    A somewhat simplified explanation of what happens is that, when for
    example, the 'cluster' size is set to 2MiB, when we do some writes, the
    data allocator will search for a free space block that is 2MiB big, and
    put the writes in there. The ssd mode itself might allow a 2MiB cluster
    to be composed of multiple free space extents with some existing data in
    between, while the additional ssd_spread mount option kills off this
    option and requires fully free space.
    
    The idea behind this is (commit 536ac8ae): "The [...] clusters make it
    more likely a given IO will completely overwrite the ssd block, so it
    doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
    block. So, effectively this means applying a "locality based algorithm"
    and trying to outsmart the actual ssd.
    
    Since then, various changes have been made to the involved code, but the
    basic idea is still present, and gets activated whenever the ssd mount
    option is active. This also happens by default, when the rotational flag
    as seen at /sys/block/<device>/queue/rotational is set to 0.
    
        However, there's a number of problems with this approach.
    
        First, what the optimization is trying to do is outsmart the ssd by
    assuming there is a relation between the physical address space of the
    block device as seen by btrfs and the actual physical storage of the
    ssd, and then adjusting data placement. However, since the introduction
    of the Flash Translation Layer (FTL) which is a part of the internal
    controller of an ssd, these attempts are futile. The use of good quality
    FTL in consumer ssd products might have been limited in 2008, but this
    situation has changed drastically soon after that time. Today, even the
    flash memory in your automatic cat feeding machine or your grandma's
    wheelchair has a full featured one.
    
    Second, the behaviour as described above results in the filesystem being
    filled up with badly fragmented free space extents because of relatively
    small pieces of space that are freed up by deletes, but not selected
    again as part of a 'cluster'. Since the algorithm prefers allocating a
    new chunk over going back to tetris mode, the end result is a filesystem
    in which all raw space is allocated, but which is composed of
    underutilized chunks with a 'shotgun blast' pattern of fragmented free
    space. Usually, the next problematic thing that happens is the
    filesystem wanting to allocate new space for metadata, which causes the
    filesystem to fail in spectacular ways.
    
    Third, the default mount options you get for an ssd ('ssd' mode enabled,
    'discard' not enabled), in combination with spreading out writes over
    the full address space and ignoring freed up space leads to worst case
    behaviour in providing information to the ssd itself, since it will
    never learn that all the free space left behind is actually free.  There
    are two ways to let an ssd know previously written data does not have to
    be preserved, which are sending explicit signals using discard or
    fstrim, or by simply overwriting the space with new data.  The worst
    case behaviour is the btrfs ssd_spread mount option in combination with
    not having discard enabled. It has a side effect of minimizing the reuse
    of free space previously written in.
    
    Fourth, the rotational flag in /sys/ does not reliably indicate if the
    device is a locally attached ssd. For example, iSCSI or NBD displays as
    non-rotational, while a loop device on an ssd shows up as rotational.
    
    The combination of the second and third problem effectively means that
    despite all the good intentions, the btrfs ssd mode reliably causes the
    ssd hardware and the filesystem structures and performance to be choked
    to death. The clickbait version of the title of this story would have
    been "Btrfs ssd optimizations considered harmful for ssds".
    
    The current nossd 'tetris' mode (even still without discard) allows a
    pattern of overwriting much more previously used space, causing many
    more implicit discards to happen because of the overwrite information
    the ssd gets. The actual location in the physical address space, as seen
    from the point of view of btrfs is irrelevant, because the actual writes
    to the low level flash are reordered anyway thanks to the FTL.
    
        Changes made in the code
    
    1. Make ssd mode data allocation identical to tetris mode, like nossd.
    2. Adjust and clean up filesystem mount messages so that we can easily
    identify if a kernel has this patch applied or not, when providing
    support to end users. Also, make better use of the *_and_info helpers to
    only trigger messages on actual state changes.
    
        Backporting notes
    
    Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
    * First apply commit 951e7966 "btrfs: drop the nossd flag when
      remounting with -o ssd", or fixup the differences manually.
    * The rest of the conflicts are because of the fs_info refactoring. So,
      for example, instead of using fs_info, it's root->fs_info in
      extent-tree.c
    Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    583b7231
extent-tree.c 300 KB