1. 04 Sep, 2015 40 commits
    • Joseph Qi's avatar
      ocfs2: fix race between crashed dio and rm · ad694821
      Joseph Qi authored
      There is a race case between crashed dio and rm, which will lead to
      OCFS2_VALID_FL not set read-only.
      
        N1                              N2
        ------------------------------------------------------------------------
        dd with direct flag
                                        rm file
        crashed with an dio entry left
        in orphan dir
                                        clear OCFS2_VALID_FL in
                                        ocfs2_remove_inode
                                        recover N1 and read the corrupted inode,
                                        and set filesystem read-only
      
      So we skip the inode deletion this time and wait for dio entry recovered
      first.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad694821
    • Yiwen Jiang's avatar
      ocfs2: avoid access invalid address when read o2dlm debug messages · f57a22dd
      Yiwen Jiang authored
      The following case will lead to a lockres is freed but is still in use.
      
      cat /sys/kernel/debug/o2dlm/locking_state	dlm_thread
      lockres_seq_start
          -> lock dlm->track_lock
          -> get resA
                                                      resA->refs decrease to 0,
                                                      call dlm_lockres_release,
                                                      and wait for "cat" unlock.
      Although resA->refs is already set to 0,
      increase resA->refs, and then unlock
                                                      lock dlm->track_lock
                                                          -> list_del_init()
                                                          -> unlock
                                                          -> free resA
      
      In such a race case, invalid address access may occurs.  So we should
      delete list res->tracking before resA->refs decrease to 0.
      Signed-off-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f57a22dd
    • Tariq Saeed's avatar
      ocfs2: take inode lock in ocfs2_iop_set/get_acl() · 743b5f14
      Tariq Saeed authored
      This bug in mainline code is pointed out by Mark Fasheh.  When
      ocfs2_iop_set_acl() and ocfs2_iop_get_acl() are entered from VFS layer,
      inode lock is not held.  This seems to be regression from older kernels.
      The patch is to fix that.
      
      Orabug: 20189959
      Signed-off-by: default avatarTariq Saeed <tariq.x.saeed@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      743b5f14
    • Tariq Saeed's avatar
      ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed() · 3d46a44a
      Tariq Saeed authored
      PID: 614    TASK: ffff882a739da580  CPU: 3   COMMAND: "ocfs2dc"
        #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
        #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
        #2 [ffff882ecc375af0] oops_end at ffffffff815091d8
        #3 [ffff882ecc375b20] die at ffffffff8101868b
        #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
        #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
        #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
           [exception RIP: ocfs2_ci_checkpointed+208]
           RIP: ffffffffa0a7e940  RSP: ffff882ecc375cf0  RFLAGS: 00010002
           RAX: 0000000000000001  RBX: 000000000000654b  RCX: ffff8812dc83f1f8
           RDX: 00000000000017d9  RSI: ffff8812dc83f1f8  RDI: ffffffffa0b2c318
           RBP: ffff882ecc375d20   R8: ffff882ef6ecfa60   R9: ffff88301f272200
           R10: 0000000000000000  R11: 0000000000000000  R12: ffffffffffffffff
           R13: ffff8812dc83f4f0  R14: 0000000000000000  R15: ffff8812dc83f1f8
           ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2]
        #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
        #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2]
      #10 [ffff882ecc375e18] ocfs2_downconvert_thread_do_work at ffffffffa0a85445 [ocfs2]
      #11 [ffff882ecc375e68] ocfs2_downconvert_thread at ffffffffa0a854de [ocfs2]
      #12 [ffff882ecc375ee8] kthread at ffffffff81090da7
      #13 [ffff882ecc375f48] kernel_thread_helper at ffffffff81511884
      assert is tripped because the tran is not checkpointed and the lock level is PR.
      
      Some time ago, chmod command had been executed. As result, the following call
      chain left the inode cluster lock in PR state, latter on causing the assert.
      system_call_fastpath
        -> my_chmod
         -> sys_chmod
          -> sys_fchmodat
           -> notify_change
            -> ocfs2_setattr
             -> posix_acl_chmod
              -> ocfs2_iop_set_acl
               -> ocfs2_set_acl
                -> ocfs2_acl_set_mode
      Here is how.
      1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
      1120 {
      1247         ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
      ..
      1258         if (!status && attr->ia_valid & ATTR_MODE) {
      1259                 status =  posix_acl_chmod(inode, inode->i_mode);
      
      519 posix_acl_chmod(struct inode *inode, umode_t mode)
      520 {
      ..
      539         ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);
      
      287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
      288 {
      289         return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);
      
      224 int ocfs2_set_acl(handle_t *handle,
      225                          struct inode *inode, ...
      231 {
      ..
      252                                 ret = ocfs2_acl_set_mode(inode, di_bh,
      253                                                          handle, mode);
      
      168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
      170 {
      183         if (handle == NULL) {
                          >>> BUG: inode lock not held in ex at this point <<<
      184                 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
      185                                            OCFS2_INODE_UPDATE_CREDITS);
      
      ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
      ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
      mode (it should be). How this could have happended?
      
      We are the lock master, were holding lock EX and have released it in
      ocfs2_setattr.#1247.  Note that there are no holders of this lock at
      this point.  Another node needs the lock in PR, and we downconvert from
      EX to PR.  So the inode lock is PR when do the trans in
      ocfs2_acl_set_mode.#184.  The trans stays in core (not flushed to disc).
      Now another node want the lock in EX, downconvert thread gets kicked
      (the one that tripped assert abovt), finds an unflushed trans but the
      lock is not EX (it is PR).  If the lock was at EX, it would have flushed
      the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before
      downconverting (to NULL) for the request.
      
      ocfs2_setattr must not drop inode lock ex in this code path.  If it
      does, takes it again before the trans, say in ocfs2_set_acl, another
      cluster node can get in between, execute another setattr, overwriting
      the one in progress on this node, resulting in a mode acl size combo
      that is a mix of the two.
      
      Orabug: 20189959
      Signed-off-by: default avatarTariq Saeed <tariq.x.saeed@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d46a44a
    • Norton.Zhu's avatar
      ocfs2: optimize error handling in dlm_request_join · 72f6fe1f
      Norton.Zhu authored
      Currently error handling in dlm_request_join is a little obscure, so
      optimize it to promote readability.
      
      If packet.code is invalid, reset it to JOIN_DISALLOW to keep it
      meaningful.  It only influences the log printing.
      Signed-off-by: default avatarNorton.Zhu <norton.zhu@huawei.com>
      Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72f6fe1f
    • Yiwen Jiang's avatar
      ocfs2: fix a tiny case that inode can not removed · 928dda1f
      Yiwen Jiang authored
      When running dirop_fileop_racer we found a case that inode
      can not removed.
      
      Two nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
      two dirs /race/1/ and /race/2/ in the filesystem.
      
        Node A                            Node B
        rm -r /race/2/
                                          mv /race/1/ /race/2/
        call ocfs2_unlink(), get
        the EX mode of /race/2/
                                          wait for B unlock /race/2/
        decrease i_nlink of /race/2/ to 0,
        and add inode of /race/2/ into
        orphan dir, unlock /race/2/
                                          got EX mode of /race/2/. because
                                          /race/1/ is dir, so inc i_nlink
                                          of /race/2/ and update into disk,
                                          unlock /race/2/
        because i_nlink of /race/2/
        is not zero, this inode will
        always remain in orphan dir
      
      This patch fixes this case by test whether i_nlink of new dir is zero.
      Signed-off-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Xue jiufei <xuejiufei@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      928dda1f
    • WeiWei Wang's avatar
      ocfs2: add ip_alloc_sem in direct IO to protect allocation changes · 6ab855a9
      WeiWei Wang authored
      In ocfs2, ip_alloc_sem is used to protect allocation changes on the
      node.  In direct IO, we add ip_alloc_sem to protect date consistent
      between direct-io and ocfs2_truncate_file race (buffer io use
      ip_alloc_sem already).  Although inode->i_mutex lock is used to avoid
      concurrency of above situation, i think ip_alloc_sem is still needed
      because protect allocation changes is significant.
      
      Other filesystem like ext4 also uses rw_semaphore to protect data
      consistent between get_block-vs-truncate race by other means, So
      ip_alloc_sem in ocfs2 direct io is needed.
      Signed-off-by: default avatarWeiwei Wang <wangww631@huawei.com>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ab855a9
    • Goldwyn Rodrigues's avatar
      ocfs2: clear the rest of the buffers on error · 34237681
      Goldwyn Rodrigues authored
      In case a validation fails, clear the rest of the buffers and return the
      error to the calling function.
      
      This also facilitates bubbling up the error originating from ocfs2_error
      to calling functions.
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34237681
    • Goldwyn Rodrigues's avatar
      ocfs2: acknowledge return value of ocfs2_error() · 17a5b9ab
      Goldwyn Rodrigues authored
      Caveat: This may return -EROFS for a read case, which seems wrong.  This
      is happening even without this patch series though.  Should we convert
      EROFS to EIO?
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17a5b9ab
    • Goldwyn Rodrigues's avatar
      ocfs2: add errors=continue · 7d0fb914
      Goldwyn Rodrigues authored
      OCFS2 is often used in high-availaibility systems.  However, ocfs2
      converts the filesystem to read-only at the drop of the hat.  This may
      not be necessary, since turning the filesystem read-only would affect
      other running processes as well, decreasing availability.
      
      This attempt is to add errors=continue, which would return the EIO to
      the calling process and terminate furhter processing so that the
      filesystem is not corrupted further.  However, the filesystem is not
      converted to read-only.
      
      As a future plan, I intend to create a small utility or extend
      fsck.ocfs2 to fix small errors such as in the inode.  The input to the
      utility such as the inode can come from the kernel logs so we don't have
      to schedule a downtime for fixing small-enough errors.
      
      The patch changes the ocfs2_error to return an error.  The error
      returned depends on the mount option set.  If none is set, the default
      is to turn the filesystem read-only.
      
      Perhaps errors=continue is not the best option name.  Historically it is
      used for making an attempt to progress in the current process itself.
      Should we call it errors=eio? or errors=killproc? Suggestions/Comments
      welcome.
      
      Sources are available at:
        https://github.com/goldwynr/linux/tree/error-contSigned-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d0fb914
    • Xue jiufei's avatar
      ocfs2: flush inode data to disk and free inode when i_count becomes zero · 513e2dae
      Xue jiufei authored
      Disk inode deletion may be heavily delayed when one node unlink a file
      after the same dentry is freed on another node(say N1) because of memory
      shrink but inode is left in memory.  This inode can only be freed while
      N1 doing the orphan scan work.
      
      However, N1 may skip orphan scan for several times because other nodes
      may do the work earlier.  In our tests, it may take 1 hour on 4 nodes
      cluster and it hurts the user experience.  So we think the inode should
      be freed after the data flushed to disk when i_count becomes zero to
      avoid such circumstances.
      Signed-off-by: default avatarJoyce.xue <xuejiufei@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      513e2dae
    • Sanidhya Kashyap's avatar
      ocfs2: trusted xattr missing CAP_SYS_ADMIN check · 0f5e7b41
      Sanidhya Kashyap authored
      The trusted extended attributes are only visible to the process which
      hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
      xattr_handler trusted list.  The check is important because this will be
      used for implementing mechanisms in the userspace for which other
      ordinary processes should not have access to.
      Signed-off-by: default avatarSanidhya Kashyap <sanidhya.gatech@gmail.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Taesoo kim <taesoo@gatech.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f5e7b41
    • jiangyiwen's avatar
      ocfs2: set filesytem read-only when ocfs2_delete_entry failed. · 807a7907
      jiangyiwen authored
      In ocfs2_rename, it will lead to an inode with two entried(old and new) if
      ocfs2_delete_entry(old) failed.  Thus, filesystem will be inconsistent.
      
      The case is described below:
      
      ocfs2_rename
          -> ocfs2_start_trans
          -> ocfs2_add_entry(new)
          -> ocfs2_delete_entry(old)
              -> __ocfs2_journal_access *failed* because of -ENOMEM
          -> ocfs2_commit_trans
      
      So filesystem should be set to read-only at the moment.
      Signed-off-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      807a7907
    • Joseph Qi's avatar
      ocfs2/dlm: use list_for_each_entry instead of list_for_each · f83c7b5e
      Joseph Qi authored
      Use list_for_each_entry instead of list_for_each to simplify code.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f83c7b5e
    • Joseph Qi's avatar
      ocfs2: remove unneeded code in dlm_register_domain_handlers · 0e3d9eaf
      Joseph Qi authored
      The last goto statement is unneeded, so remove it.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e3d9eaf
    • Joseph Qi's avatar
      ocfs2: fix BUG when o2hb_register_callback fails · cdd09f49
      Joseph Qi authored
      In dlm_register_domain_handlers, if o2hb_register_callback fails, it
      will call dlm_unregister_domain_handlers to unregister.  This will
      trigger the BUG_ON in o2hb_unregister_callback because hc_magic is 0.
      So we should call o2hb_setup_callback to initialize hc first.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdd09f49
    • Joseph Qi's avatar
      ocfs2: remove unneeded code in ocfs2_dlm_init · 914a9b74
      Joseph Qi authored
      status is already initialized and it will only be 0 or negatives in the
      code flow.  So remove the unneeded assignment after the lable 'local'.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      914a9b74
    • Joseph Qi's avatar
      ocfs2: adjust code to match locking/unlocking order · 3cb2ec43
      Joseph Qi authored
      Unlocking order in ocfs2_unlink and ocfs2_rename mismatches the
      corresponding locking order, although it won't cause issues, adjust the
      code so that it looks more reasonable.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cb2ec43
    • Joseph Qi's avatar
      ocfs2: clean up unused local variables in ocfs2_file_write_iter · bf59e662
      Joseph Qi authored
      Since commit 86b9c6f3 ("ocfs2: remove filesize checks for sync I/O
      journal commit") removes filesize checks for sync I/O journal commit,
      variables old_size and old_clusters are not actually used any more.  So
      clean them up.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf59e662
    • Christophe JAILLET's avatar
      ocfs2: do not log twice error messages · 372a447c
      Christophe JAILLET authored
      'o2hb_map_slot_data' and 'o2hb_populate_slot_data' are called from only
      one place, in 'o2hb_region_dev_write'.  Return value is checked and
      'mlog_errno' is called to log a message if it is not 0.
      
      So there is no need to call 'mlog_errno' directly within these functions.
      This would result on logging the message twice.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      372a447c
    • Joseph Qi's avatar
      ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access · acf8fdbe
      Joseph Qi authored
      When storage network is unstable, it may trigger the BUG in
      __ocfs2_journal_access because of buffer not uptodate.  We can retry the
      write in this case or return error instead of BUG.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reported-by: default avatarZhangguanghui <zhang.guanghui@h3c.com>
      Tested-by: default avatarZhangguanghui <zhang.guanghui@h3c.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acf8fdbe
    • Joseph Qi's avatar
      ocfs2: fix several issues of append dio · faaebf18
      Joseph Qi authored
      1) Take rw EX lock in case of append dio.
      2) Explicitly treat the error code -EIOCBQUEUED as normal.
      3) Set di_bh to NULL after brelse if it may be used again later.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Yiwen Jiang <jiangyiwen@huawei.com>
      Cc: Weiwei Wang <wangww631@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      faaebf18
    • Joseph Qi's avatar
      ocfs2: fix race between dio and recover orphan · 512f62ac
      Joseph Qi authored
      During direct io the inode will be added to orphan first and then
      deleted from orphan.  There is a race window that the orphan entry will
      be deleted twice and thus trigger the BUG when validating
      OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.
      
      ocfs2_direct_IO_write
          ...
          ocfs2_add_inode_to_orphan
          >>>>>>>> race window.
                   1) another node may rm the file and then down, this node
                   take care of orphan recovery and clear flag
                   OCFS2_DIO_ORPHANED_FL.
                   2) since rw lock is unlocked, it may race with another
                   orphan recovery and append dio.
          ocfs2_del_inode_from_orphan
      
      So take inode mutex lock when recovering orphans and make rw unlock at the
      end of aio write in case of append dio.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reported-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Weiwei Wang <wangww631@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      512f62ac
    • Alexander Kuleshov's avatar
      sh: use PFN_DOWN macro · 81cf09ed
      Alexander Kuleshov authored
      Replace ((x) >> PAGE_SHIFT) with the predefined PFN_DOWN macro.
      Signed-off-by: default avatarAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81cf09ed
    • SF Markus Elfring's avatar
      ntfs: delete unnecessary checks before calling iput() · 917520e1
      SF Markus Elfring authored
      iput() tests whether its argument is NULL and then returns immediately.
      Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Reviewed-by: default avatarAnton Altaparmakov <anton@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      917520e1
    • Zhao Lei's avatar
      scripts/spelling.txt: add some typo-words · 35108d71
      Zhao Lei authored
      I wrote a small script to show word-pair from all linux spelling-typo
      commits, and get following result by sort | uniq -c:
      
          181 occured -> occurred
           78 transfered -> transferred
           67 recieved -> received
           65 dependant -> dependent
           58 wether -> whether
           56 accomodate -> accommodate
           54 occured -> occurred
           51 recieve -> receive
           47 cant -> can't
           40 sucessfully -> successfully
           ...
      
      Some of them are not in spelling.txt, this patch adds the most common
      word-pairs into spelling.txt.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35108d71
    • Robert Jarzmik's avatar
      scripts: decode_stacktrace: fix ARM architecture decoding · e260fe01
      Robert Jarzmik authored
      Fix the stack decoder for the ARM architecture.
      An ARM stack is designed as :
      
      [   81.547704] [<c023eb04>] (bucket_find_contain) from [<c023ec88>] (check_sync+0x40/0x4f8)
      [   81.559668] [<c023ec88>] (check_sync) from [<c023f8c4>] (debug_dma_sync_sg_for_cpu+0x128/0x194)
      [   81.571583] [<c023f8c4>] (debug_dma_sync_sg_for_cpu) from [<c0327dec>] (__videobuf_s
      
      The current script doesn't expect the symbols to be bound by
      parenthesis, and triggers the following errors :
      
        awk: cmd. line:1: error: Unmatched ( or \(: / (check_sync$/
        [   81.547704] (bucket_find_contain) from (check_sync+0x40/0x4f8)
      
      Fix it by chopping starting and ending parenthesis from the each symbol
      name.
      
      As a side note, this probably comes from the function
      dump_backtrace_entry(), which is implemented differently for each
      architecture.  That makes a single decoding script a bit a challenge.
      Signed-off-by: default avatarRobert Jarzmik <robert.jarzmik@free.fr>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e260fe01
    • Jean Delvare's avatar
      scripts/Lindent: handle missing indent gracefully · fa70900e
      Jean Delvare authored
      If indent is not found, bail out immediately instead of spitting random
      shell script error messages.
      Signed-off-by: default avatarJean Delvare <jdelvare@suse.de>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa70900e
    • Bart Van Assche's avatar
      kerneldoc: Convert error messages to GNU error message format · d40e1e65
      Bart Van Assche authored
      Editors like emacs and vi recognize a number of error message formats.
      The format used by the kerneldoc tool is not recognized by emacs.
      
      Change the kerneldoc error message format to the GNU style such that the
      emacs prev-error and next-error commands can be used to navigate through
      kerneldoc error messages.  For more information about the GNU error
      message format, see also
        https://www.gnu.org/prep/standards/html_node/Errors.html.
      
      This patch has been generated via the following sed command:
      
        sed -i.orig 's/Error(\${file}:\$.):/\${file}:\$.: error:/g;s/Warning(\${file}:\$.):/\${file}:\$.: warning:/g;s/Warning(\${file}):/\${file}:1: warning:/g;s/Info(\${file}:\$.):/\${file}:\$.: info:/g' scripts/kernel-doc
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Johannes Berg <johannes.berg@intel.com>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d40e1e65
    • Sudip Mukherjee's avatar
      scripts/spelling.txt: spelling of uninitialized · c22b6ae6
      Sudip Mukherjee authored
      I just did a spelling mistake of uninitialized and wrote that as
      unintialized.  Fortunately I noticed it in my final review.
      Signed-off-by: default avatarSudip Mukherjee <sudip@vectorindia.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c22b6ae6
    • Maninder Singh's avatar
      scripts/spelling.txt: add misspelled words for check · 779a6ce8
      Maninder Singh authored
      misspelled words for check:-
       chcek
       chck
       cehck
      
      I myself did these spell mistakes in changelog for patches, Thus
      suggesting to add in spelling.txt, so that checkpatch.pl warns it
      earlier.  References:-
      
      ./arch/powerpc/kernel/exceptions-64e.S:456: . . . make sure you chcek
      https://lkml.org/lkml/2015/6/25/289
      ./arch/x86/mm/pageattr.c:1368: * No need to cehck in that case
      
      [akpm@linux-foundation.org: add whcih->which, whcih I always get wrong]
      Signed-off-by: default avatarManinder Singh <maninder1.s@samsung.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      779a6ce8
    • Jan Kara's avatar
      fsnotify: get rid of fsnotify_destroy_mark_locked() · 4712e722
      Jan Kara authored
      fsnotify_destroy_mark_locked() is subtle to use because it temporarily
      releases group->mark_mutex.  To avoid future problems with this
      function, split it into two.
      
      fsnotify_detach_mark() is the part that needs group->mark_mutex and
      fsnotify_free_mark() is the part that must be called outside of
      group->mark_mutex.  This way it's much clearer what's going on and we
      also avoid some pointless acquisitions of group->mark_mutex.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4712e722
    • Jan Kara's avatar
      fsnotify: remove mark->free_list · 925d1132
      Jan Kara authored
      Free list is used when all marks on given inode / mount should be
      destroyed when inode / mount is going away.  However we can free all of
      the marks without using a special list with some care.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      925d1132
    • Jan Kara's avatar
      fsnotify: document mark locking · 1e39fc01
      Jan Kara authored
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e39fc01
    • Jan Kara's avatar
      fsnotify: fix check in inotify fdinfo printing · 3c53e514
      Jan Kara authored
      A check in inotify_fdinfo() checking whether mark is valid was always
      true due to a bug.  Luckily we can never get to invalidated marks since
      we hold mark_mutex and invalidated marks get removed from the group list
      when they are invalidated under that mutex.
      
      Anyway fix the check to make code more future proof.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c53e514
    • Dave Hansen's avatar
      fs/notify: optimize inotify/fsnotify code for unwatched files · 7c49b861
      Dave Hansen authored
      I have a _tiny_ microbenchmark that sits in a loop and writes single
      bytes to a file.  Writing one byte to a tmpfs file is around 2x slower
      than reading one byte from a file, which is a _bit_ more than I expecte.
      This is a dumb benchmark, but I think it's hard to deny that write() is
      a hot path and we should avoid unnecessary overhead there.
      
      I did a 'perf record' of 30-second samples of read and write.  The top
      item in a diffprofile is srcu_read_lock() from fsnotify().  There are
      active inotify fd's from systemd, but nothing is actually listening to
      the file or its part of the filesystem.
      
      I *think* we can avoid taking the srcu_read_lock() for the common case
      where there are no actual marks on the file.  This means that there will
      both be nothing to notify for *and* implies that there is no need for
      clearing the ignore mask.
      
      This patch gave a 13.1% speedup in writes/second on my test, which is an
      improvement from the 10.8% that I saw with the last version.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c49b861
    • Yuriy Kolerov's avatar
      drivers/video/concole: add negative dependency for VGA_CONSOLE on ARC · 031e29b5
      Yuriy Kolerov authored
      Architectures which support VGA console must define screen_info
      structurture from "uapi/linux/screen_info.h".  Otherwise undefined
      symbol error occurs.  Usually it's defined in "setup.c" for each
      architecture.
      
      If an architecture does not support VGA console (ARC's case) there are 2
      ways: define a dummy instance of screen_info or add a negative
      dependency for VGA_CONSOLE in to prevent selecting this option.
      
      I've implemented the second way.  However the best solution is to add
      HAVE_VGA_CONSOLE option for targets which support VGA console.  Then
      turn off VGA_CONSOLE by default and add dependency to HAVE_VGA_CONSOLE.
      But right now it's better to just add a negative dependency for ARC and
      then consider how to collaborate about this issue with maintainers of
      other architectures.
      Signed-off-by: default avatarYuriy Kolerov <yuriy.kolerov@synopsys.com>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Jaya Kumar <jayalk@intworks.biz>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      031e29b5
    • Andy Lutomirski's avatar
      capabilities: add a securebit to disable PR_CAP_AMBIENT_RAISE · 746bf6d6
      Andy Lutomirski authored
      Per Andrew Morgan's request, add a securebit to allow admins to disable
      PR_CAP_AMBIENT_RAISE.  This securebit will prevent processes from adding
      capabilities to their ambient set.
      
      For simplicity, this disables PR_CAP_AMBIENT_RAISE entirely rather than
      just disabling setting previously cleared bits.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarAndrew G. Morgan <morgan@kernel.org>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Aaron Jones <aaronmdjones@gmail.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
      Cc: Markku Savela <msa@moth.iki.fi>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      746bf6d6
    • Andy Lutomirski's avatar
      selftests/capabilities: Add tests for capability evolution · 32ae976e
      Andy Lutomirski authored
      This test focuses on ambient capabilities.  It requires either root or
      the ability to create user namespaces.  Some of the test cases will be
      skipped for nonroot users.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com> # Original author
      Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32ae976e
    • Andy Lutomirski's avatar
      capabilities: ambient capabilities · 58319057
      Andy Lutomirski authored
      Credit where credit is due: this idea comes from Christoph Lameter with
      a lot of valuable input from Serge Hallyn.  This patch is heavily based
      on Christoph's patch.
      
      ===== The status quo =====
      
      On Linux, there are a number of capabilities defined by the kernel.  To
      perform various privileged tasks, processes can wield capabilities that
      they hold.
      
      Each task has four capability masks: effective (pE), permitted (pP),
      inheritable (pI), and a bounding set (X).  When the kernel checks for a
      capability, it checks pE.  The other capability masks serve to modify
      what capabilities can be in pE.
      
      Any task can remove capabilities from pE, pP, or pI at any time.  If a
      task has a capability in pP, it can add that capability to pE and/or pI.
      If a task has CAP_SETPCAP, then it can add any capability to pI, and it
      can remove capabilities from X.
      
      Tasks are not the only things that can have capabilities; files can also
      have capabilities.  A file can have no capabilty information at all [1].
      If a file has capability information, then it has a permitted mask (fP)
      and an inheritable mask (fI) as well as a single effective bit (fE) [2].
      File capabilities modify the capabilities of tasks that execve(2) them.
      
      A task that successfully calls execve has its capabilities modified for
      the file ultimately being excecuted (i.e.  the binary itself if that
      binary is ELF or for the interpreter if the binary is a script.) [3] In
      the capability evolution rules, for each mask Z, pZ represents the old
      value and pZ' represents the new value.  The rules are:
      
        pP' = (X & fP) | (pI & fI)
        pI' = pI
        pE' = (fE ? pP' : 0)
        X is unchanged
      
      For setuid binaries, fP, fI, and fE are modified by a moderately
      complicated set of rules that emulate POSIX behavior.  Similarly, if
      euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
      (primary, fP and fI usually end up being the full set).  For nonroot
      users executing binaries with neither setuid nor file caps, fI and fP
      are empty and fE is false.
      
      As an extra complication, if you execute a process as nonroot and fE is
      set, then the "secure exec" rules are in effect: AT_SECURE gets set,
      LD_PRELOAD doesn't work, etc.
      
      This is rather messy.  We've learned that making any changes is
      dangerous, though: if a new kernel version allows an unprivileged
      program to change its security state in a way that persists cross
      execution of a setuid program or a program with file caps, this
      persistent state is surprisingly likely to allow setuid or file-capped
      programs to be exploited for privilege escalation.
      
      ===== The problem =====
      
      Capability inheritance is basically useless.
      
      If you aren't root and you execute an ordinary binary, fI is zero, so
      your capabilities have no effect whatsoever on pP'.  This means that you
      can't usefully execute a helper process or a shell command with elevated
      capabilities if you aren't root.
      
      On current kernels, you can sort of work around this by setting fI to
      the full set for most or all non-setuid executable files.  This causes
      pP' = pI for nonroot, and inheritance works.  No one does this because
      it's a PITA and it isn't even supported on most filesystems.
      
      If you try this, you'll discover that every nonroot program ends up with
      secure exec rules, breaking many things.
      
      This is a problem that has bitten many people who have tried to use
      capabilities for anything useful.
      
      ===== The proposed change =====
      
      This patch adds a fifth capability mask called the ambient mask (pA).
      pA does what most people expect pI to do.
      
      pA obeys the invariant that no bit can ever be set in pA if it is not
      set in both pP and pI.  Dropping a bit from pP or pI drops that bit from
      pA.  This ensures that existing programs that try to drop capabilities
      still do so, with a complication.  Because capability inheritance is so
      broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
      then calling execve effectively drops capabilities.  Therefore,
      setresuid from root to nonroot conditionally clears pA unless
      SECBIT_NO_SETUID_FIXUP is set.  Processes that don't like this can
      re-add bits to pA afterwards.
      
      The capability evolution rules are changed:
      
        pA' = (file caps or setuid or setgid ? 0 : pA)
        pP' = (X & fP) | (pI & fI) | pA'
        pI' = pI
        pE' = (fE ? pP' : pA')
        X is unchanged
      
      If you are nonroot but you have a capability, you can add it to pA.  If
      you do so, your children get that capability in pA, pP, and pE.  For
      example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
      automatically bind low-numbered ports.  Hallelujah!
      
      Unprivileged users can create user namespaces, map themselves to a
      nonzero uid, and create both privileged (relative to their namespace)
      and unprivileged process trees.  This is currently more or less
      impossible.  Hallelujah!
      
      You cannot use pA to try to subvert a setuid, setgid, or file-capped
      program: if you execute any such program, pA gets cleared and the
      resulting evolution rules are unchanged by this patch.
      
      Users with nonzero pA are unlikely to unintentionally leak that
      capability.  If they run programs that try to drop privileges, dropping
      privileges will still work.
      
      It's worth noting that the degree of paranoia in this patch could
      possibly be reduced without causing serious problems.  Specifically, if
      we allowed pA to persist across executing non-pA-aware setuid binaries
      and across setresuid, then, naively, the only capabilities that could
      leak as a result would be the capabilities in pA, and any attacker
      *already* has those capabilities.  This would make me nervous, though --
      setuid binaries that tried to privilege-separate might fail to do so,
      and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
      unexpected side effects.  (Whether these unexpected side effects would
      be exploitable is an open question.) I've therefore taken the more
      paranoid route.  We can revisit this later.
      
      An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
      ambient capabilities.  I think that this would be annoying and would
      make granting otherwise unprivileged users minor ambient capabilities
      (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
      it is with this patch.
      
      ===== Footnotes =====
      
      [1] Files that are missing the "security.capability" xattr or that have
      unrecognized values for that xattr end up with has_cap set to false.
      The code that does that appears to be complicated for no good reason.
      
      [2] The libcap capability mask parsers and formatters are dangerously
      misleading and the documentation is flat-out wrong.  fE is *not* a mask;
      it's a single bit.  This has probably confused every single person who
      has tried to use file capabilities.
      
      [3] Linux very confusingly processes both the script and the interpreter
      if applicable, for reasons that elude me.  The results from thinking
      about a script's file capabilities and/or setuid bits are mostly
      discarded.
      
      Preliminary userspace code is here, but it needs updating:
      https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2
      
      Here is a test program that can be used to verify the functionality
      (from Christoph):
      
      /*
       * Test program for the ambient capabilities. This program spawns a shell
       * that allows running processes with a defined set of capabilities.
       *
       * (C) 2015 Christoph Lameter <cl@linux.com>
       * Released under: GPL v3 or later.
       *
       *
       * Compile using:
       *
       *	gcc -o ambient_test ambient_test.o -lcap-ng
       *
       * This program must have the following capabilities to run properly:
       * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
       *
       * A command to equip the binary with the right caps is:
       *
       *	setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
       *
       *
       * To get a shell with additional caps that can be inherited by other processes:
       *
       *	./ambient_test /bin/bash
       *
       *
       * Verifying that it works:
       *
       * From the bash spawed by ambient_test run
       *
       *	cat /proc/$$/status
       *
       * and have a look at the capabilities.
       */
      
      #include <stdlib.h>
      #include <stdio.h>
      #include <errno.h>
      #include <cap-ng.h>
      #include <sys/prctl.h>
      #include <linux/capability.h>
      
      /*
       * Definitions from the kernel header files. These are going to be removed
       * when the /usr/include files have these defined.
       */
      #define PR_CAP_AMBIENT 47
      #define PR_CAP_AMBIENT_IS_SET 1
      #define PR_CAP_AMBIENT_RAISE 2
      #define PR_CAP_AMBIENT_LOWER 3
      #define PR_CAP_AMBIENT_CLEAR_ALL 4
      
      static void set_ambient_cap(int cap)
      {
      	int rc;
      
      	capng_get_caps_process();
      	rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
      	if (rc) {
      		printf("Cannot add inheritable cap\n");
      		exit(2);
      	}
      	capng_apply(CAPNG_SELECT_CAPS);
      
      	/* Note the two 0s at the end. Kernel checks for these */
      	if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
      		perror("Cannot set cap");
      		exit(1);
      	}
      }
      
      int main(int argc, char **argv)
      {
      	int rc;
      
      	set_ambient_cap(CAP_NET_RAW);
      	set_ambient_cap(CAP_NET_ADMIN);
      	set_ambient_cap(CAP_SYS_NICE);
      
      	printf("Ambient_test forking shell\n");
      	if (execv(argv[1], argv + 1))
      		perror("Cannot exec");
      
      	return 0;
      }
      
      Signed-off-by: Christoph Lameter <cl@linux.com> # Original author
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Aaron Jones <aaronmdjones@gmail.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
      Cc: Markku Savela <msa@moth.iki.fi>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58319057