- 23 Sep, 2016 5 commits
-
-
Eric W. Biederman authored
From: Andrey Vagin <avagin@openvz.org> Each namespace has an owning user namespace and now there is not way to discover these relationships. Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships too. Why we may want to know relationships between namespaces? One use would be visualization, in order to understand the running system. Another would be to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? One more use-case (which usually called abnormal) is checkpoint/restart. In CRIU we are going to dump and restore nested namespaces. There [1] was a discussion about which interface to choose to determing relationships between namespaces. Eric suggested to add two ioctl-s [2]: > Grumble, Grumble. I think this may actually a case for creating ioctls > for these two cases. Now that random nsfs file descriptors are bind > mountable the original reason for using proc files is not as pressing. > > One ioctl for the user namespace that owns a file descriptor. > One ioctl for the parent namespace of a namespace file descriptor. Here is an implementaions of these ioctl-s. $ man man7/namespaces.7 ... Since Linux 4.X, the following ioctl(2) calls are supported for namespace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same meaning. In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. [1] https://lkml.org/lkml/2016/7/6/158 [2] https://lkml.org/lkml/2016/7/9/101 Changes for v2: * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing outside of the init namespace, so we can return EPERM in this case too. > The fewer special cases the easier the code is to get > correct, and the easier it is to read. // Eric Changes for v3: * rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: "W. Trevor King" <wking@tremily.us> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hallyn@canonical.com>
-
Andrey Vagin authored
There are two new ioctl-s: One ioctl for the user namespace that owns a file descriptor. One ioctl for the parent namespace of a namespace file descriptor. The test checks that these ioctl-s works and that they handle a case when a target namespace is outside of the current process namespace. Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
-
Andrey Vagin authored
Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships. In a future we will use this interface to dump and restore nested namespaces. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
-
Andrey Vagin authored
Each namespace has an owning user namespace and now there is not way to discover these relationships. Understending namespaces relationships allows to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? After a long discussion, Eric W. Biederman proposed to use ioctl-s for this purpose. The NS_GET_USERNS ioctl returns a file descriptor to an owning user namespace. It returns EPERM if a target namespace is outside of a current user namespace. v2: rename parent to relative v3: Add a missing mntput when returning -EAGAIN --EWB Acked-by: Serge Hallyn <serge@hallyn.com> Link: https://lkml.org/lkml/2016/7/6/158Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
-
Andrey Vagin authored
Return -EPERM if an owning user namespace is outside of a process current user namespace. v2: In a first version ns_get_owner returned ENOENT for init_user_ns. This special cases was removed from this version. There is nothing outside of init_user_ns, so we can return EPERM. v3: rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
-
- 22 Sep, 2016 8 commits
-
-
Eric W. Biederman authored
In 99.99% of the cases only root in a user namespace can mount /dev/pts and in those cases the owner of /dev/pts/ptmx will remain root.root In the oddball case where someone else has CAP_SYS_ADMIN this code modifies the /dev/pts mount code to use current_fsuid and current_fsgid as the values to use when creating the /dev/ptmx inode. As is done when any other file is created. This is a code simplification, and it allows running without a root user entirely. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
devpts does not and never will have anything to sync so don't bother calling sync_filesystems on remount. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Now that all of the work of setting up a superblock has been moved to devpts_fill_super simplify devpts_mount by calling mount_nodev instead of rolling mount_nodev by hand. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
The code makes more sense here and things are just clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
The current error codes returned when a the per user per user namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I asked for advice on linux-api and it we made clear that those were the wrong error code, but a correct effor code was not suggested. The best general error code I have found for hitting a resource limit is ENOSPC. It is not perfect but as it is unambiguous it will serve until someone comes up with a better error code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
- 31 Aug, 2016 1 commit
-
-
Eric W. Biederman authored
v2: Fixed the very obvious lack of setting ucounts on struct mnt_ns reported by Andrei Vagin, and the kbuild test report. Reported-by: Andrei Vagin <avagin@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
- 08 Aug, 2016 12 commits
-
-
Eric W. Biederman authored
Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
The same kind of recursive sane default limit and policy countrol that has been implemented for the user namespace is desirable for the other namespaces, so generalize the user namespace refernce count into a ucount. Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Add a structure that is per user and per user ns and use it to hold the count of user namespaces. This makes prevents one user from creating denying service to another user by creating the maximum number of user namespaces. Rename the sysctl export of the maximum count from /proc/sys/userns/max_user_namespaces to /proc/sys/user/max_user_namespaces to reflect that the count is now per user. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Export the export the maximum number of user namespaces as /proc/sys/userns/max_user_namespaces. Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Limit per userns sysctls to only be opened for write by a holder of CAP_SYS_RESOURCE. Add all of the necessary boilerplate for having per user namespace sysctls. Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Add the necessary boiler plate to move freeing of user namespaces into work queue and thus into process context where things can sleep. This is a necessary precursor to per user namespace sysctls. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Eric W. Biederman authored
Passing nsproxy into sysctl_table_root.lookup was a premature optimization in attempt to avoid depending on current. The directory /proc/self/sys has not appeared and if and when it does this code will need to be reviewed closely and reworked anyway. So remove the premature optimization. Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-
Linus Torvalds authored
-
- 07 Aug, 2016 10 commits
-
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull more block fixes from Jens Axboe: "As mentioned in the pull the other day, a few more fixes for this round, all related to the bio op changes in this series. Two fixes, and then a cleanup, renaming bio->bi_rw to bio->bi_opf. I wanted to do that change right after or right before -rc1, so that risk of conflict was reduced. I just rebased the series on top of current master, and no new ->bi_rw usage has snuck in" * 'for-linus' of git://git.kernel.dk/linux-block: block: rename bio bi_rw to bi_opf target: iblock_execute_sync_cache() should use bio_set_op_attrs() mm: make __swap_writepage() use bio_set_op_attrs() block/mm: make bdev_ops->rw_page() take a bool for read/write
-
git://people.freedesktop.org/~airlied/linuxLinus Torvalds authored
Pull drm zpos property support from Dave Airlie: "This tree was waiting on some media stuff I hadn't had time to get a stable branchpoint off, so I just waited until it was all in your tree first. It's been around a bit on the list and shouldn't affect anything outside adding the generic API and moving some ARM drivers to using it" * tag 'drm-for-v4.8-zpos' of git://people.freedesktop.org/~airlied/linux: drm: rcar: use generic code for managing zpos plane property drm/exynos: use generic code for managing zpos plane property drm: sti: use generic zpos for plane drm: add generic zpos property
-
Jens Axboe authored
Since commit 63a4cc24, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jens Axboe authored
The original commit missed this function, it needs to mark it a write flush. Cc: Mike Christie <mchristi@redhat.com> Fixes: e742fc32 ("target: use bio op accessors") Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jens Axboe authored
Cleaner than manipulating bio->bi_rw flags directly. Signed-off-by: Jens Axboe <axboe@fb.com>
-
Jens Axboe authored
Commit abf54548 changed it from an 'rw' flags type to the newer ops based interface, but now we're effectively leaking some bdev internals to the rest of the kernel. Since we only care about whether it's a read or a write at that level, just pass in a bool 'is_write' parameter instead. Then we can also move op_is_write() and friends back under CONFIG_BLOCK protection. Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
-
git://git.lwn.net/linuxLinus Torvalds authored
Pull documentation fixes from Jonathan Corbet: "Three fixes for the docs build, including removing an annoying warning on 'make help' if sphinx isn't present" * tag 'doc-4.8-fixes' of git://git.lwn.net/linux: DocBook: use DOCBOOKS="" to ignore DocBooks instead of IGNORE_DOCBOOKS=1 Documenation: update cgroup's document path Documentation/sphinx: do not warn about missing tools in 'make help'
-
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_miscLinus Torvalds authored
Pull binfmt_misc update from James Bottomley: "This update is to allow architecture emulation containers to function such that the emulation binary can be housed outside the container itself. The container and fs parts both have acks from relevant experts. To use the new feature you have to add an F option to your binfmt_misc configuration" From the docs: "The usual behaviour of binfmt_misc is to spawn the binary lazily when the misc format file is invoked. However, this doesn't work very well in the face of mount namespaces and changeroots, so the F mode opens the binary as soon as the emulation is installed and uses the opened image to spawn the emulator, meaning it is always available once installed, regardless of how the environment changes" * tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc: binfmt_misc: add F option description to documentation binfmt_misc: add persistent opened binary handler for containers fs: add filp_clone_open API
-
Eryu Guan authored
In most cases, EPERM is returned on immutable inode, and there're only a few places returning EACCES. I noticed this when running LTP on overlayfs, setxattr03 failed due to unexpected EACCES on immutable inode. So converting all EACCES to EPERM on immutable inode. Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull more vfs updates from Al Viro: "Assorted cleanups and fixes. In the "trivial API change" department - ->d_compare() losing 'parent' argument" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cachefiles: Fix race between inactivating and culling a cache object 9p: use clone_fid() 9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()" vfs: make dentry_needs_remove_privs() internal vfs: remove file_needs_remove_privs() vfs: fix deadlock in file_remove_privs() on overlayfs get rid of 'parent' argument of ->d_compare() cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare() affs ->d_compare(): don't bother with ->d_inode fold _d_rehash() and __d_rehash() together fold dentry_rcuwalk_invalidate() into its only remaining caller
-
- 06 Aug, 2016 4 commits
-
-
Linus Torvalds authored
Merge tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull more xfs updates from Dave Chinner: "This is the second part of the XFS updates for this merge cycle, and contains the new reverse block mapping feature for XFS. Reverse mapping allows us to track the owner of a specific block on disk precisely. It is implemented as a set of btrees (one per allocation group) that track the owners of allocated extents. Effectively it is a "used space tree" that is updated when we allocate or free extents. i.e. it is coherent with the free space btrees we already maintain and never overlaps with them. This reverse mapping infrastructure is the building block of several upcoming features - reflink, copy-on-write data, dedupe, online metadata and data scrubbing, highly accurate bad sector/data loss reporting to users, and significantly improved reconstruction of damaged and corrupted filesystems. There's a lot of new stuff coming along in the next couple of cycles,a nd it all builds in the rmap infrastructure. As such, it's a huge chunk of new code with new on-disk format features and internal infrastructure. It warns at mount time as an experimental feature and that it may eat data (as we do with all new on-disk features until they stabilise). We have not released userspace suport for it yet - userspace support currently requires download from Darrick's xfsprogs repo and build from source, so the access to this feature is really developer/tester only at this point. Initial userspace support will be released at the same time kernel with this code in it is released. The new rmap enabled code regresses 3 xfstests - all are ENOSPC related corner cases, one of which Darrick posted a fix for a few hours ago. The other two are fixed by infrastructure that is part of the upcoming reflink patchset. This new ENOSPC infrastructure requires a on-disk format tweak required to keep mount times in check - we need to keep an on-disk count of allocated rmapbt blocks so we don't have to scan the entire btrees at mount time to count them. This is currently being tested and will be part of the fixes sent in the next week or two so users will not be exposed to this change" * tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (52 commits) xfs: move (and rename) the deferred bmap-free tracepoints xfs: collapse single use static functions xfs: remove unnecessary parentheses from log redo item recovery functions xfs: remove the extents array from the rmap update done log item xfs: in btree_lshift, only allocate temporary cursor when needed xfs: remove unnecesary lshift/rshift key initialization xfs: remove the get*keys and update_keys btree ops pointers xfs: enable the rmap btree functionality xfs: don't update rmapbt when fixing agfl xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled xfs: add rmap btree block detection to log recovery xfs: add rmap btree geometry feature flag xfs: propagate bmap updates to rmapbt xfs: enable the xfs_defer mechanism to process rmaps to update xfs: log rmap intent items xfs: create rmap update intent log items xfs: add rmap btree insert and delete helpers xfs: convert unwritten status of reverse mappings xfs: remove an extent from the rmap btree xfs: add an extent to the rmap btree ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds authored
Pull qstr constification updates from Al Viro: "Fairly self-contained bunch - surprising lot of places passes struct qstr * as an argument when const struct qstr * would suffice; it complicates analysis for no good reason. I'd prefer to feed that separately from the assorted fixes (those are in #for-linus and with somewhat trickier topology)" * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: qstr: constify instances in adfs qstr: constify instances in lustre qstr: constify instances in f2fs qstr: constify instances in ext2 qstr: constify instances in vfat qstr: constify instances in procfs qstr: constify instances in fuse qstr constify instances in fs/dcache.c qstr: constify instances in nfs qstr: constify instances in ocfs2 qstr: constify instances in autofs4 qstr: constify instances in hfs qstr: constify instances in hfsplus qstr: constify instances in logfs qstr: constify dentry_init_security
-
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-mediaLinus Torvalds authored
Pull mailcap fixlets from Mauro Carvalho Chehab: "A small fixup for my and Shuah's entries in .mailcap. Basically, those entries were with a syntax that makes get_maintainer.pl to do the wrong thing" * tag 'media/v4.8-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: .mailmap: Correct entries for Mauro Carvalho Chehab and Shuah Khan
-
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds authored
Pull virtio/vhost updates from Michael Tsirkin: - new vsock device support in host and guest - platform IOMMU support in host and guest, including compatibility quirks for legacy systems. - misc fixes and cleanups. * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: VSOCK: Use kvfree() vhost: split out vringh Kconfig vhost: detect 32 bit integer wrap around vhost: new device IOTLB API vhost: drop vringh dependency vhost: convert pre sorted vhost memory array to interval tree vhost: introduce vhost memory accessors VSOCK: Add Makefile and Kconfig VSOCK: Introduce vhost_vsock.ko VSOCK: Introduce virtio_transport.ko VSOCK: Introduce virtio_vsock_common.ko VSOCK: defer sock removal to transports VSOCK: transport-specific vsock_transport functions vhost: drop vringh dependency vop: pull in vhost Kconfig virtio: new feature to detect IOMMU device quirk balloon: check the number of available pages in leak balloon vhost: lockless enqueuing vhost: simplify work flushing
-