1. 21 Aug, 2015 5 commits
    • Eric W. Biederman's avatar
      vfs: Test for and handle paths that are unreachable from their mnt_root · 397d425d
      Eric W. Biederman authored
      In rare cases a directory can be renamed out from under a bind mount.
      In those cases without special handling it becomes possible to walk up
      the directory tree to the root dentry of the filesystem and down
      from the root dentry to every other file or directory on the filesystem.
      
      Like division by zero .. from an unconnected path can not be given
      a useful semantic as there is no predicting at which path component
      the code will realize it is unconnected.  We certainly can not match
      the current behavior as the current behavior is a security hole.
      
      Therefore when encounting .. when following an unconnected path
      return -ENOENT.
      
      - Add a function path_connected to verify path->dentry is reachable
        from path->mnt.mnt_root.  AKA to validate that rename did not do
        something nasty to the bind mount.
      
        To avoid races path_connected must be called after following a path
        component to it's next path component.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      397d425d
    • Eric W. Biederman's avatar
      dcache: Reduce the scope of i_lock in d_splice_alias · a03e283b
      Eric W. Biederman authored
      i_lock is only needed until __d_find_any_alias calls dget on the alias
      dentry.  After that the reference to new ensures that dentry_kill and
      d_delete will not remove the inode from the dentry, and remove the
      dentry from the inode->d_entry list.
      
      The inode i_lock came to be held over the the __d_move calls in
      d_splice_alias through a series of introduction of locks with
      increasing smaller scope.  First it was the dcache_lock, then
      it was the dcache_inode_lock, and finally inode->i_lock.
      
      Furthermore inode->i_lock is not held over any other calls
      to d_move or __d_move so it can not provide any meaningful
      rename protection.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a03e283b
    • Eric W. Biederman's avatar
      dcache: Handle escaped paths in prepend_path · cde93be4
      Eric W. Biederman authored
      A rename can result in a dentry that by walking up d_parent
      will never reach it's mnt_root.  For lack of a better term
      I call this an escaped path.
      
      prepend_path is called by four different functions __d_path,
      d_absolute_path, d_path, and getcwd.
      
      __d_path only wants to see paths are connected to the root it passes
      in.  So __d_path needs prepend_path to return an error.
      
      d_absolute_path similarly wants to see paths that are connected to
      some root.  Escaped paths are not connected to any mnt_root so
      d_absolute_path needs prepend_path to return an error greater
      than 1.  So escaped paths will be treated like paths on lazily
      unmounted mounts.
      
      getcwd needs to prepend "(unreachable)" so getcwd also needs
      prepend_path to return an error.
      
      d_path is the interesting hold out.  d_path just wants to print
      something, and does not care about the weird cases.  Which raises
      the question what should be printed?
      
      Given that <escaped_path>/<anything> should result in -ENOENT I
      believe it is desirable for escaped paths to be printed as empty
      paths.  As there are not really any meaninful path components when
      considered from the perspective of a mount tree.
      
      So tweak prepend_path to return an empty path with an new error
      code of 3 when it encounters an escaped path.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cde93be4
    • Hugh Dickins's avatar
      mm: fix potential data race in SyS_swapon · 6f179af8
      Hugh Dickins authored
      While running KernelThreadSanitizer (ktsan) on upstream kernel with
      trinity, we got a few reports from SyS_swapon, here is one of them:
      
      Read of size 8 by thread T307 (K7621):
       [<     inlined    >] SyS_swapon+0x3c0/0x1850 SYSC_swapon mm/swapfile.c:2395
       [<ffffffff812242c0>] SyS_swapon+0x3c0/0x1850 mm/swapfile.c:2345
       [<ffffffff81e97c8a>] ia32_do_call+0x1b/0x25
      
      Looks like the swap_lock should be taken when iterating through the
      swap_info array on lines 2392 - 2401: q->swap_file may be reset to
      NULL by another thread before it is dereferenced for f_mapping.
      
      But why is that iteration needed at all?  Doesn't the claim_swapfile()
      which follows do all that is needed to check for a duplicate entry -
      FMODE_EXCL on a bdev, testing IS_SWAPFILE under i_mutex on a regfile?
      
      Well, not quite: bd_may_claim() allows the same "holder" to claim the
      bdev again, so we do need to use a different holder than "sys_swapon";
      and we should not replace appropriate -EBUSY by inappropriate -EINVAL.
      
      Index i was reused in a cpu loop further down: renamed cpu there.
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6f179af8
    • Al Viro's avatar
      Merge branch 'superblock-scaling' of... · 061f98e9
      Al Viro authored
      Merge branch 'superblock-scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-next
      
      Conflicts:
      	include/linux/fs.h
      061f98e9
  2. 19 Aug, 2015 2 commits
  3. 18 Aug, 2015 2 commits
    • Josef Bacik's avatar
      inode: don't softlockup when evicting inodes · ac05fbb4
      Josef Bacik authored
      On a box with a lot of ram (148gb) I can make the box softlockup after running
      an fs_mark job that creates hundreds of millions of empty files.  This is
      because we never generate enough memory pressure to keep the number of inodes on
      our unused list low, so when we go to unmount we have to evict ~100 million
      inodes.  This makes one processor a very unhappy person, so add a cond_resched()
      in dispose_list() and if we need a resched when processing the s_inodes list do
      that and run dispose_list() on what we've currently culled.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      ac05fbb4
    • Dave Chinner's avatar
      inode: rename i_wb_list to i_io_list · c7f54084
      Dave Chinner authored
      There's a small consistency problem between the inode and writeback
      naming. Writeback calls the "for IO" inode queues b_io and
      b_more_io, but the inode calls these the "writeback list" or
      i_wb_list. This makes it hard to an new "under writeback" list to
      the inode, or call it an "under IO" list on the bdi because either
      way we'll have writeback on IO and IO on writeback and it'll just be
      confusing. I'm getting confused just writing this!
      
      So, rename the inode "for IO" list variable to i_io_list so we can
      add a new "writeback list" in a subsequent patch.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      c7f54084
  4. 17 Aug, 2015 4 commits
    • Dave Chinner's avatar
      sync: serialise per-superblock sync operations · e97fedb9
      Dave Chinner authored
      When competing sync(2) calls walk the same filesystem, they need to
      walk the list of inodes on the superblock to find all the inodes
      that we need to wait for IO completion on. However, when multiple
      wait_sb_inodes() calls do this at the same time, they contend on the
      the inode_sb_list_lock and the contention causes system wide
      slowdowns. In effect, concurrent sync(2) calls can take longer and
      burn more CPU than if they were serialised.
      
      Stop the worst of the contention by adding a per-sb mutex to wrap
      around wait_sb_inodes() so that we only execute one sync(2) IO
      completion walk per superblock superblock at a time and hence avoid
      contention being triggered by concurrent sync(2) calls.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      e97fedb9
    • Dave Chinner's avatar
      inode: convert inode_sb_list_lock to per-sb · 74278da9
      Dave Chinner authored
      The process of reducing contention on per-superblock inode lists
      starts with moving the locking to match the per-superblock inode
      list. This takes the global lock out of the picture and reduces the
      contention problems to within a single filesystem. This doesn't get
      rid of contention as the locks still have global CPU scope, but it
      does isolate operations on different superblocks form each other.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      74278da9
    • Josef Bacik's avatar
      inode: add hlist_fake to avoid the inode hash lock in evict · cbedaac6
      Josef Bacik authored
      Some filesystems don't use the VFS inode hash and fake the fact they
      are hashed so that all the writeback code works correctly. However,
      this means the evict() path still tries to remove the inode from the
      hash, meaning that the inode_hash_lock() needs to be taken
      unnecessarily. Hence under certain workloads the inode_hash_lock can
      be contended even if the inode is never actually hashed.
      
      To avoid this add hlist_fake to test if the inode isn't actually
      hashed to avoid taking the hash lock on inodes that have never been
      hashed.  Based on Dave Chinner's
      
      inode: add IOP_NOTHASHED to avoid inode hash lock in evict
      
      basd on Al's suggestions.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      cbedaac6
    • Dave Chinner's avatar
      writeback: plug writeback at a high level · d353d758
      Dave Chinner authored
      Doing writeback on lots of little files causes terrible IOPS storms
      because of the per-mapping writeback plugging we do. This
      essentially causes imeediate dispatch of IO for each mapping,
      regardless of the context in which writeback is occurring.
      
      IOWs, running a concurrent write-lots-of-small 4k files using fsmark
      on XFS results in a huge number of IOPS being issued for data
      writes.  Metadata writes are sorted and plugged at a high level by
      XFS, so aggregate nicely into large IOs. However, data writeback IOs
      are dispatched in individual 4k IOs, even when the blocks of two
      consecutively written files are adjacent.
      
      Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
      metadata CRCs enabled.
      
      Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
      
      Test:
      
      $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
      /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
      /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
      /mnt/scratch/6  -d  /mnt/scratch/7
      
      Result:
      
      		wall	sys	create rate	Physical write IO
      		time	CPU	(avg files/s)	 IOPS	Bandwidth
      		-----	-----	------------	------	---------
      unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
      patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
      improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%
      
      If I use zero length files, this workload at about 500 IOPS, so
      plugging drops the data IOs from roughly 25,500/s to 1000/s.
      3 lines of code, 35% better throughput for 15% less CPU.
      
      The benefits of plugging at this layer are likely to be higher for
      spinning media as the IO patterns for this workload are going make a
      much bigger difference on high IO latency devices.....
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d353d758
  5. 16 Aug, 2015 8 commits
  6. 15 Aug, 2015 11 commits
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 1efdb5f0
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "This has two libfc fixes for bugs causing rare crashes, one iscsi fix
        for a potential hang on shutdown, and a fix for an I/O blocksize issue
        which caused a regression"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        sd: Fix maximum I/O size for BLOCK_PC requests
        libfc: Fix fc_fcp_cleanup_each_cmd()
        libfc: Fix fc_exch_recv_req() error path
        libiscsi: Fix host busy blocking during connection teardown
      1efdb5f0
    • Oleg Nesterov's avatar
      change sb_writers to use percpu_rw_semaphore · 8129ed29
      Oleg Nesterov authored
      We can remove everything from struct sb_writers except frozen
      and add the array of percpu_rw_semaphore's instead.
      
      This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
      it for get_super_thawed(). We will probably remove it later.
      
      This change tries to address the following problems:
      
      	- Firstly, __sb_start_write() looks simply buggy. It does
      	  __sb_end_write() if it sees ->frozen, but if it migrates
      	  to another CPU before percpu_counter_dec(), sb_wait_write()
      	  can wrongly succeed if there is another task which holds
      	  the same "semaphore": sb_wait_write() can miss the result
      	  of the previous percpu_counter_inc() but see the result
      	  of this percpu_counter_dec().
      
      	- As Dave Hansen reports, it is suboptimal. The trivial
      	  microbenchmark that writes to a tmpfs file in a loop runs
      	  12% faster if we change this code to rely on RCU and kill
      	  the memory barriers.
      
      	- This code doesn't look simple. It would be better to rely
      	  on the generic locking code.
      
      	  According to Dave, this change adds the same performance
      	  improvement.
      
      Note: with this change both freeze_super() and thaw_super() will do
      synchronize_sched_expedited() 3 times. This is just ugly. But:
      
      	- This will be "fixed" by the rcu_sync changes we are going
      	  to merge. After that freeze_super()->percpu_down_write()
      	  will use synchronize_sched(), and thaw_super() won't use
      	  synchronize() at all.
      
      	  This doesn't need any changes in fs/super.c.
      
      	- Once we merge rcu_sync changes, we can also change super.c
      	  so that all wb_write->rw_sem's will share the single ->rss
      	  in struct sb_writes, then freeze_super() will need only one
      	  synchronize_sched().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      8129ed29
    • Oleg Nesterov's avatar
      shift percpu_counter_destroy() into destroy_super_work() · 853b39a7
      Oleg Nesterov authored
      Of course, this patch is ugly as hell. It will be (partially)
      reverted later. We add it to ensure that other WIP changes in
      percpu_rw_semaphore won't break fs/super.c.
      
      We do not even need this change right now, percpu_free_rwsem()
      is fine in atomic context. But we are going to change this, it
      will be might_sleep() after we merge the rcu_sync() patches.
      
      And even after that we do not really need destroy_super_work(),
      we will kill it in any case. Instead, destroy_super_rcu() should
      just check that rss->cb_state == CB_IDLE and do call_rcu() again
      in the (very unlikely) case this is not true.
      
      So this is just the temporary kludge which helps us to avoid the
      conflicts with the changes which will be (hopefully) routed via
      rcu tree.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      853b39a7
    • Oleg Nesterov's avatar
      percpu-rwsem: kill CONFIG_PERCPU_RWSEM · bf3eac84
      Oleg Nesterov authored
      Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional
      user of percpu_rw_semaphore.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      bf3eac84
    • Oleg Nesterov's avatar
      percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() · 55cc1565
      Oleg Nesterov authored
      Add percpu_rwsem_release() and percpu_rwsem_acquire() for the users
      which need to return to userspace with percpu-rwsem lock held and/or
      pass the ownership to another thread.
      
      TODO: change percpu_rwsem_release() to use rwsem_clear_owner(). We can
      either fold kernel/locking/rwsem.h into include/linux/rwsem.h, or add
      the non-inline percpu_rwsem_clear_owner().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      55cc1565
    • Oleg Nesterov's avatar
      percpu-rwsem: introduce percpu_down_read_trylock() · 9287f692
      Oleg Nesterov authored
      Add percpu_down_read_trylock(), it will have the user soon.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      9287f692
    • Oleg Nesterov's avatar
      document rwsem_release() in sb_wait_write() · 0e28e01f
      Oleg Nesterov authored
      Not only we need to avoid the warning from lockdep_sys_exit(), the
      caller of freeze_super() can never release this lock. Another thread
      can do this, so there is another reason for rwsem_release().
      
      Plus the comment should explain why we have to fool lockdep.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      0e28e01f
    • Oleg Nesterov's avatar
      fix the broken lockdep logic in __sb_start_write() · f4b554af
      Oleg Nesterov authored
      1. wait_event(frozen < level) without rwsem_acquire_read() is just
         wrong from lockdep perspective. If we are going to deadlock
         because the caller is buggy, lockdep can't detect this problem.
      
      2. __sb_start_write() can race with thaw_super() + freeze_super(),
         and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.
      
      3. The "tell lockdep we are doing trylock" hack doesn't look nice.
      
         I think this is correct, but this logic should be more explicit.
         Yes, the recursive read_lock() is fine if we hold the lock on a
         higher level. But we do not need to fool lockdep. If we can not
         deadlock in this case then try-lock must not fail and we can use
         use wait == F throughout this code.
      
      Note: as Dave Chinner explains, the "trylock" hack and the fat comment
      can be probably removed. But this needs a separate change and it will
      be trivial: just kill __sb_start_write() and rename do_sb_start_write()
      back to __sb_start_write().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      f4b554af
    • Oleg Nesterov's avatar
      introduce __sb_writers_{acquired,release}() helpers · bee9182d
      Oleg Nesterov authored
      Preparation to hide the sb->s_writers internals from xfs and btrfs.
      Add 2 trivial define's they can use rather than play with ->s_writers
      directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      bee9182d
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 45e38cff
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "Just two very small & simple patches"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Use adjustment in guest cycles when handling MSR_IA32_TSC_ADJUST
        KVM: x86: zero IDT limit on entry to SMM
      45e38cff
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 8394a1b7
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "11 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        Update maintainers for DRM STI driver
        mm: cma: mark cma_bitmap_maxno() inline in header
        zram: fix pool name truncation
        memory-hotplug: fix wrong edge when hot add a new node
        .mailmap: Andrey Ryabinin has moved
        ipc/sem.c: update/correct memory barriers
        mm/hwpoison: fix panic due to split huge zero page
        ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()
        ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits
        mm/hwpoison: fix fail isolate hugetlbfs page w/ refcount held
        mm/hwpoison: fix page refcount of unknown non LRU page
      8394a1b7
  7. 14 Aug, 2015 8 commits