• Christian Brauner's avatar
    xattr: use rbtree for simple_xattrs · 3b4c7bc0
    Christian Brauner authored
    A while ago Vasily reported that it is possible to set a large number of
    xattrs on inodes of filesystems that make use of the simple xattr
    infrastructure. This includes all kernfs-based filesystems that support
    xattrs (e.g., cgroupfs and tmpfs). Both cgroupfs and tmpfs can be
    mounted by unprivileged users in unprivileged containers and root in an
    unprivileged container can set an unrestricted number of security.*
    xattrs and privileged users can also set unlimited trusted.* xattrs. As
    there are apparently users that have a fairly large number of xattrs we
    should scale a bit better. Other xattrs such as user.* are restricted
    for kernfs-based instances to a fairly limited number.
    
    Using a simple linked list protected by a spinlock used for set, get,
    and list operations doesn't scale well if users use a lot of xattrs even
    if it's not a crazy number. There's no need to bring in the big guns
    like rhashtables or rw semaphores for this. An rbtree with a rwlock, or
    limited rcu semanics and seqlock is enough.
    
    It scales within the constraints we are working in. By far the most
    common operation is getting an xattr. Setting xattrs should be a
    moderately rare operation. And listxattr() often only happens when
    copying xattrs between files or together with the contents to a new
    file. Holding a lock across listxattr() is unproblematic because it
    doesn't list the values of xattrs. It can only be used to list the names
    of all xattrs set on a file. And the number of xattr names that can be
    listed with listxattr() is limited to XATTR_LIST_MAX aka 65536 bytes. If
    a larger buffer is passed then vfs_listxattr() caps it to XATTR_LIST_MAX
    and if more xattr names are found it will return -E2BIG. In short, the
    maximum amount of memory that can be retrieved via listxattr() is
    limited.
    
    Of course, the API is broken as documented on xattr(7) already. In the
    future we might want to address this but for now this is the world we
    live in and have lived for a long time. But it does indeed mean that
    once an application goes over XATTR_LIST_MAX limit of xattrs set on an
    inode it isn't possible to copy the file and include its xattrs in the
    copy unless the caller knows all xattrs or limits the copy of the xattrs
    to important ones it knows by name (At least for tmpfs, and kernfs-based
    filesystems. Other filesystems might provide ways of achieving this.).
    
    Bonus of this port to rbtree+rwlock is that we shrink the memory
    consumption for users of the simple xattr infrastructure.
    
    Also add proper kernel documentation to all the functions.
    A big thanks to Paul for his comments.
    
    Cc: Vasily Averin <vvs@openvz.org>
    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
    Acked-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
    3b4c7bc0
shmem.c 111 KB