1. 19 Dec, 2016 1 commit
    • Mauricio Faria de Oliveira's avatar
      block: allow WRITE_SAME commands with the SG_IO ioctl · 25cdb645
      Mauricio Faria de Oliveira authored
      The WRITE_SAME commands are not present in the blk_default_cmd_filter
      write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
      is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
      [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]
      
      The problem can be reproduced with the sg_write_same command
      
        # sg_write_same --num 1 --xferlen 512 /dev/sda
        #
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_same --num 1 --xferlen 512 /dev/sda'
          Write same: pass through os error: Operation not permitted
        #
      
      For comparison, the WRITE_VERIFY command does not observe this problem,
      since it is in that list:
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
        #
      
      So, this patch adds the WRITE_SAME commands to the list, in order
      for the SG_IO ioctl to finish successfully:
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_same --num 1 --xferlen 512 /dev/sda'
        #
      
      That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
      (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]),
      which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).
      
      In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
      which are translated to write-same calls in the guest kernel, and then into
      SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:
      
        [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
        [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
        [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
        [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
        [...] blk_update_request: I/O error, dev sda, sector 17096824
      
      Links:
      [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
      [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
      Signed-off-by: default avatarMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: default avatarBrahadambal Srinivasan <latha@linux.vnet.ibm.com>
      Reported-by: default avatarManjunatha H R <manjuhr1@in.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      25cdb645
  2. 17 Dec, 2016 3 commits
  3. 16 Dec, 2016 36 commits
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client · 59331c21
      Linus Torvalds authored
      Pull ceph updates from Ilya Dryomov:
       "A varied set of changes:
      
         - a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
           (myself). Also fixed a deadlock caused by a bogus allocation on the
           writeback path and authorize reply verification.
      
         - a fix for long stalls during fsync (Jeff Layton). The client now
           has a way to force the MDS log flush, leading to ~100x speedups in
           some synthetic tests.
      
         - a new [no]require_active_mds mount option (Zheng Yan).
      
           On mount, we will now check whether any of the MDSes are available
           and bail rather than block if none are. This check can be avoided
           by specifying the "no" option.
      
         - a couple of MDS cap handling fixes and a few assorted patches
           throughout"
      
      * tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
        libceph: remove now unused finish_request() wrapper
        libceph: always signal completion when done
        ceph: avoid creating orphan object when checking pool permission
        ceph: properly set issue_seq for cap release
        ceph: add flags parameter to send_cap_msg
        ceph: update cap message struct version to 10
        ceph: define new argument structure for send_cap_msg
        ceph: move xattr initialzation before the encoding past the ceph_mds_caps
        ceph: fix minor typo in unsafe_request_wait
        ceph: record truncate size/seq for snap data writeback
        ceph: check availability of mds cluster on mount
        ceph: fix splice read for no Fc capability case
        ceph: try getting buffer capability for readahead/fadvise
        ceph: fix scheduler warning due to nested blocking
        ceph: fix printing wrong return variable in ceph_direct_read_write()
        crush: include mapper.h in mapper.c
        rbd: silence bogus -Wmaybe-uninitialized warning
        libceph: no need to drop con->mutex for ->get_authorizer()
        libceph: drop len argument of *verify_authorizer_reply()
        libceph: verify authorize reply on connect
        ...
      59331c21
    • Linus Torvalds's avatar
      Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs · ff0f962c
      Linus Torvalds authored
      Pull overlayfs updates from Miklos Szeredi:
       "This update contains:
      
         - try to clone on copy-up
      
         - allow renaming a directory
      
         - split source into managable chunks
      
         - misc cleanups and fixes
      
        It does not contain the read-only fd data inconsistency fix, which Al
        didn't like. I'll leave that to the next year..."
      
      * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits)
        ovl: fix reStructuredText syntax errors in documentation
        ovl: fix return value of ovl_fill_super
        ovl: clean up kstat usage
        ovl: fold ovl_copy_up_truncate() into ovl_copy_up()
        ovl: create directories inside merged parent opaque
        ovl: opaque cleanup
        ovl: show redirect_dir mount option
        ovl: allow setting max size of redirect
        ovl: allow redirect_dir to default to "on"
        ovl: check for emptiness of redirect dir
        ovl: redirect on rename-dir
        ovl: lookup redirects
        ovl: consolidate lookup for underlying layers
        ovl: fix nested overlayfs mount
        ovl: check namelen
        ovl: split super.c
        ovl: use d_is_dir()
        ovl: simplify lookup
        ovl: check lower existence of rename target
        ovl: rename: simplify handling of lower/merged directory
        ...
      ff0f962c
    • Linus Torvalds's avatar
      Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 087a76d3
      Linus Torvalds authored
      Pull btrfs updates from Chris Mason:
       "Jeff Mahoney and Dave Sterba have a really nice set of cleanups in
        here, and Christoph pitched in corrections/improvements to make btrfs
        use proper helpers for bio walking instead of doing it by hand.
      
        There are some key fixes as well, including some long standing bugs
        that took forever to track down in btrfs_drop_extents and during
        balance"
      
      * 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits)
        btrfs: limit async_work allocation and worker func duration
        Revert "Btrfs: adjust len of writes if following a preallocated extent"
        Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
        btrfs: opencode chunk locking, remove helpers
        btrfs: remove root parameter from transaction commit/end routines
        btrfs: split btrfs_wait_marked_extents into normal and tree log functions
        btrfs: take an fs_info directly when the root is not used otherwise
        btrfs: simplify btrfs_wait_cache_io prototype
        btrfs: convert extent-tree tracepoints to use fs_info
        btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
        btrfs: root->fs_info cleanup, add fs_info convenience variables
        btrfs: root->fs_info cleanup, update_block_group{,flags}
        btrfs: root->fs_info cleanup, lock/unlock_chunks
        btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
        btrfs: pull node/sector/stripe sizes out of root and into fs_info
        btrfs: root->fs_info cleanup, io_ctl_init
        btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
        btrfs: struct reada_control.root -> reada_control.fs_info
        btrfs: struct btrfsic_state->root should be an fs_info
        btrfs: alloc_reserved_file_extent trace point should use extent_root
        ...
      087a76d3
    • Linus Torvalds's avatar
      Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux · 759b2656
      Linus Torvalds authored
      Pull nfsd updates from Bruce Fields:
       "The one new feature is support for a new NFSv4.2 mode_umask attribute
        that makes ACL inheritance a little more useful in environments that
        default to restrictive umasks. Requires client-side support, also on
        its way for 4.10.
      
        Other than that, miscellaneous smaller fixes and cleanup, especially
        to the server rdma code"
      
      [ The client side of the umask attribute was merged yesterday ]
      
      * tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux:
        nfsd: add support for the umask attribute
        sunrpc: use DEFINE_SPINLOCK()
        svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
        svcrdma: Break up dprintk format in svc_rdma_accept()
        svcrdma: Remove unused variable in rdma_copy_tail()
        svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
        svcrdma: Remove svc_rdma_op_ctxt::wc_status
        svcrdma: Remove DMA map accounting
        svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
        svcrdma: Renovate sendto chunk list parsing
        svcauth_gss: Close connection when dropping an incoming message
        svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
        nfsd: constify reply_cache_stats_operations structure
        nfsd: update workqueue creation
        sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
        nfsd: catch errors in decode_fattr earlier
        nfsd: clean up supported attribute handling
        nfsd: fix error handling for clients that fail to return the layout
        nfsd: more robust allocation failure handling in nfsd_reply_cache_init
      759b2656
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 9a19a6db
      Linus Torvalds authored
      Pull vfs updates from Al Viro:
      
       - more ->d_init() stuff (work.dcache)
      
       - pathname resolution cleanups (work.namei)
      
       - a few missing iov_iter primitives - copy_from_iter_full() and
         friends. Either copy the full requested amount, advance the iterator
         and return true, or fail, return false and do _not_ advance the
         iterator. Quite a few open-coded callers converted (and became more
         readable and harder to fuck up that way) (work.iov_iter)
      
       - several assorted patches, the big one being logfs removal
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        logfs: remove from tree
        vfs: fix put_compat_statfs64() does not handle errors
        namei: fold should_follow_link() with the step into not-followed link
        namei: pass both WALK_GET and WALK_MORE to should_follow_link()
        namei: invert WALK_PUT logics
        namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
        namei: saner calling conventions for mountpoint_last()
        namei.c: get rid of user_path_parent()
        switch getfrag callbacks to ..._full() primitives
        make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
        [iov_iter] new primitives - copy_from_iter_full() and friends
        don't open-code file_inode()
        ceph: switch to use of ->d_init()
        ceph: unify dentry_operations instances
        lustre: switch to use of ->d_init()
      9a19a6db
    • Linus Torvalds's avatar
      Merge tag 'media/v4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · bd9999cd
      Linus Torvalds authored
      Pull media updates from Mauro Carvalho Chehab:
      
       - new Mediatek drivers: mtk-mdp and mtk-vcodec
      
       - some additions at the media documentation
      
       - the CEC core and drivers were promoted from staging to mainstream
      
       - some cleanups at the DVB core
      
       - the LIRC serial driver got promoted from staging to mainstream
      
       - added a driver for Renesas R-Car FDP1 driver
      
       - add DVBv5 statistics support to mn88473 driver
      
       - several fixes related to printk continuation lines
      
       - add support for HSV encoding formats
      
       - lots of other cleanups, fixups and driver improvements.
      
      * tag 'media/v4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (496 commits)
        [media] v4l: tvp5150: Add missing break in set control handler
        [media] v4l: tvp5150: Don't inline the tvp5150_selmux() function
        [media] v4l: tvp5150: Compile tvp5150_link_setup out if !CONFIG_MEDIA_CONTROLLER
        [media] em28xx: don't store usb_device at struct em28xx
        [media] em28xx: use usb_interface for dev_foo() calls
        [media] em28xx: don't change the device's name
        [media] mn88472: fix chip id check on probe
        [media] mn88473: fix chip id check on probe
        [media] lirc: fix error paths in lirc_cdev_add()
        [media] s5p-mfc: Add support for MFC v8 available in Exynos 5433 SoCs
        [media] s5p-mfc: Rework clock handling
        [media] s5p-mfc: Don't keep clock prepared all the time
        [media] s5p-mfc: Kill all IS_ERR_OR_NULL in clocks management code
        [media] s5p-mfc: Remove dead conditional code
        [media] s5p-mfc: Ensure that clock is disabled before turning power off
        [media] s5p-mfc: Remove special clock rate management
        [media] s5p-mfc: Use printk_ratelimited for reporting ioctl errors
        [media] s5p-mfc: Set DMA_ATTR_ALLOC_SINGLE_PAGES
        [media] vivid: Set color_enc on HSV formats
        [media] v4l2-tpg: Init hv_enc field with a valid value
        ...
      bd9999cd
    • Linus Torvalds's avatar
      Merge tag 'edac/v4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac · 9dfe495c
      Linus Torvalds authored
      Pull edac updates from Mauro Carvalho Chehab:
       "This contains the conversion of the EDAC uAPI documentation to ReST
        and the addition of the EDAC kAPI documentation to the driver-api
        docs.
      
        It also splits the EDAC headers by their functions"
      
      * tag 'edac/v4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac:
        EDAC: Document HW_EVENT_ERR_DEFERRED type
        edac.rst: move concepts dictionary from edac.h
        edac: fix kenel-doc markups at edac.h
        edac: fix kernel-doc tags at the drivers/edac_*.h
        edac: adjust docs location at MAINTAINERS and 00-INDEX
        driver-api: create an edac.rst file with EDAC documentation
        edac: move documentation from edac_mc.c to edac_core.h
        edac: move documentation from edac_pci*.c to edac_pci.h
        edac: move documentation from edac_device to edac_core.h
        edac: rename edac_core.h to edac_mc.h
        edac: move EDAC device definitions to drivers/edac/edac_device.h
        edac: move EDAC PCI definitions to drivers/edac/edac_pci.h
        docs-rst: admin-guide: add documentation for EDAC
        edac.txt: Improve documentation, adding RAS introduction
        edac.txt: update information about newer Intel CPUs
        edac.txt: remove info that the Nehalem EDAC is experimental
        edac.txt: convert EDAC documentation to ReST
        edac.txt: add a section explaining the dimmX and rankX directories
        edac: edac_core.h: remove prototype for edac_pci_reset_delay_period()
        edac: edac_core.h: get rid of unused kobj_complete
      9dfe495c
    • Linus Torvalds's avatar
      Merge branch 'for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · 9936f44a
      Linus Torvalds authored
      Pull UML update from Richard Weinberger:
       "A performance enhancement for UML's block driver"
      
      * 'for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
        um: UBD Improvements
      9936f44a
    • Linus Torvalds's avatar
      Merge tag 'nios2-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 · 70f56cbb
      Linus Torvalds authored
      Pull arch/nios2 updates from Ley Foon Tan:
      
       - add screen_info
      
       - Convert pfn_valid to static inline
      
       - Extend !__ASSEMBLY__ section in asm/page.h
      
      * tag 'nios2-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
        nios2: add screen_info
        nios2: Convert pfn_valid to static inline
        nios2: Extend !__ASSEMBLY__ section in asm/page.h
      70f56cbb
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · de399813
      Linus Torvalds authored
      Pull powerpc updates from Michael Ellerman:
       "Highlights include:
      
         - Support for the kexec_file_load() syscall, which is a prereq for
           secure and trusted boot.
      
         - Prevent kernel execution of userspace on P9 Radix (similar to
           SMEP/PXN).
      
         - Sort the exception tables at build time, to save time at boot, and
           store them as relative offsets to save space in the kernel image &
           memory.
      
         - Allow building the kernel with thin archives, which should allow us
           to build an allyesconfig once some other fixes land.
      
         - Build fixes to allow us to correctly rebuild when changing the
           kernel endian from big to little or vice versa.
      
         - Plumbing so that we can avoid doing a full mm TLB flush on P9
           Radix.
      
         - Initial stack protector support (-fstack-protector).
      
         - Support for dumping the radix (aka. Linux) and hash page tables via
           debugfs.
      
         - Fix an oops in cxl coredump generation when cxl_get_fd() is used.
      
         - Freescale updates from Scott: "Highlights include 8xx hugepage
           support, qbman fixes/cleanup, device tree updates, and some misc
           cleanup."
      
         - Many and varied fixes and minor enhancements as always.
      
        Thanks to:
          Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
          Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
          Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
          Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
          Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
          Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
          Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
          Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
          Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"
      
      [ And thanks to Michael, who took time off from a new baby to get this
        pull request done.   - Linus ]
      
      * tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
        powerpc/fsl/dts: add FMan node for t1042d4rdb
        powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
        powerpc/fsl/dts: add QMan and BMan nodes on t1024
        powerpc/fsl/dts: add QMan and BMan nodes on t1023
        soc/fsl/qman: test: use DEFINE_SPINLOCK()
        powerpc/fsl-lbc: use DEFINE_SPINLOCK()
        powerpc/8xx: Implement support of hugepages
        powerpc: get hugetlbpage handling more generic
        powerpc: port 64 bits pgtable_cache to 32 bits
        powerpc/boot: Request no dynamic linker for boot wrapper
        soc/fsl/bman: Use resource_size instead of computation
        soc/fsl/qe: use builtin_platform_driver
        powerpc/fsl_pmc: use builtin_platform_driver
        powerpc/83xx/suspend: use builtin_platform_driver
        powerpc/ftrace: Fix the comments for ftrace_modify_code
        powerpc/perf: macros for power9 format encoding
        powerpc/perf: power9 raw event format encoding
        powerpc/perf: update attribute_group data structure
        powerpc/perf: factor out the event format field
        powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
        ...
      de399813
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 57ca04ab
      Linus Torvalds authored
      Pull m ore s390 updates from Martin Schwidefsky:
       "Over 95% of the changes in this pull request are related to the zcrypt
        driver. There are five improvements for zcrypt: the ID for the CEX6
        cards is added, workload balancing and multi-domain support are
        introduced, the debug logs are overhauled and a set of tracepoints is
        added.
      
        Then there are several patches in regard to inline assemblies. One
        compile fix and several missing memory clobbers. As far as we can tell
        the omitted memory clobbers have not caused any breakage.
      
        A small change to the PCI arch code, the machine can tells us how big
        the function measurement blocks are. The PCI function measurement will
        be disabled for a device if the queried length is larger than the
        allocated size for these blocks.
      
        And two more patches to correct five printk messages.
      
        That is it for s390 in regard to the 4.10 merge window. Happy holidays"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (23 commits)
        s390/pci: query fmb length
        s390/zcrypt: add missing memory clobber to ap_qci inline assembly
        s390/extmem: add missing memory clobber to dcss_set_subcodes
        s390/nmi: fix inline assembly constraints
        s390/lib: add missing memory barriers to string inline assemblies
        s390/cpumf: fix qsi inline assembly
        s390/setup: reword printk messages
        s390/dasd: fix typos in DASD error messages
        s390: fix compile error with memmove_early() inline assembly
        s390/zcrypt: tracepoint definitions for zcrypt device driver.
        s390/zcrypt: Rework debug feature invocations.
        s390/zcrypt: Improved invalid domain response handling.
        s390/zcrypt: Fix ap_max_domain_id for older machine types
        s390/zcrypt: Correct function bits for CEX2x and CEX3x cards.
        s390/zcrypt: Fixed attrition of AP adapters and domains
        s390/zcrypt: Introduce new zcrypt device status API
        s390/zcrypt: add multi domain support
        s390/zcrypt: Introduce workload balancing
        s390/zcrypt: get rid of ap_poll_requests
        s390/zcrypt: header for the AP inline assmblies
        ...
      57ca04ab
    • Amir Goldstein's avatar
      ovl: fix reStructuredText syntax errors in documentation · c3c86996
      Amir Goldstein authored
       - Fix broken long line block quote
       - Fix missing newline before bullets list
       - Use correct numbered list syntax
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      c3c86996
    • Geliang Tang's avatar
      ovl: fix return value of ovl_fill_super · 313684c4
      Geliang Tang authored
      If kcalloc() failed, the return value of ovl_fill_super() is -EINVAL,
      not -ENOMEM. So this patch sets this value to -ENOMEM before calling
      kcalloc(), and sets it back to -EINVAL after calling kcalloc().
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      313684c4
    • Al Viro's avatar
      ovl: clean up kstat usage · 32a3d848
      Al Viro authored
      FWIW, there's a bit of abuse of struct kstat in overlayfs object
      creation paths - for one thing, it ends up with a very small subset
      of struct kstat (mode + rdev), for another it also needs link in
      case of symlinks and ends up passing it separately.
      
      IMO it would be better to introduce a separate object for that.
      
      In principle, we might even lift that thing into general API and switch
       ->mkdir()/->mknod()/->symlink() to identical calling conventions.  Hell
      knows, perhaps ->create() as well...
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      32a3d848
    • Amir Goldstein's avatar
      ovl: fold ovl_copy_up_truncate() into ovl_copy_up() · 9aba6521
      Amir Goldstein authored
      This removes code duplication.
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      9aba6521
    • Amir Goldstein's avatar
      ovl: create directories inside merged parent opaque · 97c684cc
      Amir Goldstein authored
      The benefit of making directories opaque on creation is that lookups can
      stop short when they reach the original created directory, instead of
      continue lookup the entire depth of parent directory stack.
      
      The best case is overlay with N layers, performing lookup for first level
      directory, which exists only in upper.  In that case, there will be only
      one lookup instead of N.
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      97c684cc
    • Miklos Szeredi's avatar
      ovl: opaque cleanup · 5cf5b477
      Miklos Szeredi authored
      oe->opaque is set for
      
       a) whiteouts
       b) directories having the "trusted.overlay.opaque" xattr
      
      Case b can be simplified, since setting the xattr always implies setting
      oe->opaque.  Also once set, the opaque flag is never cleared.
      
      Don't need to set opaque flag for non-directories.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      5cf5b477
    • Amir Goldstein's avatar
      ovl: show redirect_dir mount option · c5bef3a7
      Amir Goldstein authored
      Show the value of redirect_dir in /proc/mounts.
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      c5bef3a7
    • Miklos Szeredi's avatar
      ovl: allow setting max size of redirect · 3ea22a71
      Miklos Szeredi authored
      Add a module option to allow tuning the max size of absolute redirects.
      Default is 256.
      
      Size of relative redirects is naturally limited by the the underlying
      filesystem's max filename length (usually 255).
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      3ea22a71
    • Miklos Szeredi's avatar
      ovl: allow redirect_dir to default to "on" · 688ea0e5
      Miklos Szeredi authored
      This patch introduces a kernel config option and a module param.  Both can
      be used independently to turn the default value of redirect_dir on or off.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      688ea0e5
    • Amir Goldstein's avatar
      ovl: check for emptiness of redirect dir · d1595119
      Amir Goldstein authored
      Before introducing redirect_dir feature, the condition
      !ovl_lower_positive(dentry) for a directory, implied that it is a pure
      upper directory, which may be removed if empty.
      
      Now that directory can be redirect, it is possible that upper does not
      cover any lower (i.e. !ovl_lower_positive(dentry)), but the directory is a
      merge (with redirected path) and maybe non empty.
      
      Check for this case in ovl_remove_upper().
      
      This change fixes the following test case from rename-pop-dir.py
      of unionmount-testsuite:
      
          """Remove dir and rename old name"""
          d = ctx.non_empty_dir()
          d2 = ctx.no_dir()
      
          ctx.rmdir(d, err=ENOTEMPTY)
          ctx.rename(d, d2)
          ctx.rmdir(d, err=ENOENT)
          ctx.rmdir(d2, err=ENOTEMPTY)
      
      ./run --ov rename-pop-dir
      /mnt/a/no_dir103: Expected error (Directory not empty) was not produced
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      d1595119
    • Miklos Szeredi's avatar
      ovl: redirect on rename-dir · a6c60655
      Miklos Szeredi authored
      Current code returns EXDEV when a directory would need to be copied up to
      move.  We could copy up the directory tree in this case, but there's
      another, simpler solution: point to old lower directory from moved upper
      directory.
      
      This is achieved with a "trusted.overlay.redirect" xattr storing the path
      relative to the root of the overlay.  After such attribute has been set,
      the directory can be moved without further actions required.
      
      This is a backward incompatible feature, old kernels won't be able to
      correctly mount an overlay containing redirected directories.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      a6c60655
    • Miklos Szeredi's avatar
      ovl: lookup redirects · 02b69b28
      Miklos Szeredi authored
      If a directory has the "trusted.overlay.redirect" xattr, it means that the
      value of the xattr should be used to find the underlying directory on the
      next lower layer.
      
      The redirect may be relative or absolute.  Absolute redirects begin with a
      slash.
      
      A relative redirect means: instead of the current dentry's name use the
      value of the redirect to find the directory in the next lower
      layer. Relative redirects must not contain a slash.
      
      An absolute redirect means: look up the directory relative to the root of
      the overlay using the value of the redirect in the next lower layer.
      
      Redirects work on lower layers as well.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      02b69b28
    • Miklos Szeredi's avatar
      ovl: consolidate lookup for underlying layers · e28edc46
      Miklos Szeredi authored
      Use a common helper for lookup of upper and lower layers.  This paves the
      way for looking up directory redirects.
      
      No functional change.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      e28edc46
    • Amir Goldstein's avatar
      ovl: fix nested overlayfs mount · 48fab5d7
      Amir Goldstein authored
      When the upper overlayfs checks "trusted.overlay.*" xattr on the underlying
      overlayfs mount, it gets -EPERM, which confuses the upper overlayfs.
      
      Fix this by returning -EOPNOTSUPP instead of -EPERM from
      ovl_own_xattr_get() and ovl_own_xattr_set().  This behavior is consistent
      with the behavior of ovl_listxattr(), which filters out the private
      overlayfs xattrs.
      
      Note: nested overlays are deprecated.  But this change makes sense
      regardless: these xattrs are private to the overlay and should always be
      hidden.  Hence getting and setting them should indicate this.
      
      [SzMi: Use EOPNOTSUPP instead of ENODATA and use it for both getting and
      setting "trusted.overlay." xattrs.  This is a perfectly valid error code
      for "we don't support this prefix", which is the case here.]
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      48fab5d7
    • Miklos Szeredi's avatar
      ovl: check namelen · 6b2d5fe4
      Miklos Szeredi authored
      We already calculate f_namelen in statfs as the maximum of the name lengths
      provided by the filesystems taking part in the overlay.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      6b2d5fe4
    • Miklos Szeredi's avatar
      ovl: split super.c · bbb1e54d
      Miklos Szeredi authored
      fs/overlayfs/super.c is the biggest of the overlayfs source files and it
      contains various utility functions as well as the rather complicated lookup
      code.  Split these parts out to separate files.
      
      Before:
      
       1446 fs/overlayfs/super.c
      
      After:
      
        919 fs/overlayfs/super.c
        267 fs/overlayfs/namei.c
        235 fs/overlayfs/util.c
         51 fs/overlayfs/ovl_entry.h
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      bbb1e54d
    • Miklos Szeredi's avatar
      ovl: use d_is_dir() · 2b8c30e9
      Miklos Szeredi authored
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      2b8c30e9
    • Miklos Szeredi's avatar
      ovl: simplify lookup · 8ee6059c
      Miklos Szeredi authored
      If encountering a non-directory, then stop looking at lower layers.
      
      In this case the oe->opaque flag is not set anymore, which doesn't matter
      since existence of lower file is now checked at remove/rename time.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      8ee6059c
    • Miklos Szeredi's avatar
      ovl: check lower existence of rename target · 3ee23ff1
      Miklos Szeredi authored
      Check if something exists on the lower layer(s) under the target or rename
      to decide if directory needs to be marked "opaque".
      
      Marking opaque is done before the rename, and on failure the marking was
      undone.  Also the opaque xattr was removed if the target didn't cover
      anything.
      
      This patch changes behavior so that removal of "opaque" is not done in
      either of the above cases.  This means that directory may have the opaque
      flag even if it doesn't cover anything.  However this shouldn't affect the
      performance or semantics of the overalay, while simplifying the code.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      3ee23ff1
    • Miklos Szeredi's avatar
      ovl: rename: simplify handling of lower/merged directory · 370e55ac
      Miklos Szeredi authored
      d_is_dir() is safe to call on a negative dentry.  Use this fact to simplify
      handling of the lower or merged directories.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      370e55ac
    • Miklos Szeredi's avatar
      ovl: get rid of PURE type · 38e813db
      Miklos Szeredi authored
      The remainging uses of __OVL_PATH_PURE can be replaced by
      ovl_dentry_is_opaque().
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      38e813db
    • Miklos Szeredi's avatar
      ovl: check lower existence when removing · 2aff4534
      Miklos Szeredi authored
      Currently ovl_lookup() checks existence of lower file even if there's a
      non-directory on upper (which is always opaque).  This is done so that
      remove can decide whether a whiteout is needed or not.
      
      It would be better to defer this check to unlink, since most of the time
      the gathered information about opaqueness will be unused.
      
      This adds a helper ovl_lower_positive() that checks if there's anything on
      the lower layer(s).
      
      The following patches also introduce changes to how the "opaque" attribute
      is updated on directories: this attribute is added when the directory is
      creted or moved over a whiteout or object covering something on the lower
      layer.  However following changes will allow the attribute to remain on the
      directory after being moved, even if the new location doesn't cover
      anything.  Because of this, we need to check lower layers even for opaque
      directories, so that whiteout is only created when necessary.
      
      This function will later be also used to decide about marking a directory
      opaque, so deal with negative dentries as well.  When dealing with
      negative, it's enough to check for being a whiteout
      
      If the dentry is positive but not upper then it also obviously needs
      whiteout/opaque.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      2aff4534
    • Miklos Szeredi's avatar
      ovl: add ovl_dentry_is_whiteout() · c412ce49
      Miklos Szeredi authored
      And use it instead of ovl_dentry_is_opaque() where appropriate.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      c412ce49
    • Miklos Szeredi's avatar
      ovl: don't check sticky · 99f5d08e
      Miklos Szeredi authored
      Since commit 07a2daab ("ovl: Copy up underlying inode's ->i_mode to
      overlay inode") sticky checking on overlay inode is performed by the vfs,
      so checking against sticky on underlying inode is not needed.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      99f5d08e
    • Miklos Szeredi's avatar
      ovl: don't check rename to self · 804032fa
      Miklos Szeredi authored
      This is redundant, the vfs already performed this check (and was broken,
      see commit 9409e22a ("vfs: rename: check backing inode being equal")).
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      804032fa