1. 01 Oct, 2012 40 commits
    • Josef Bacik's avatar
      Btrfs: create a pinned em when writing to a prealloc range in DIO · 69ffb543
      Josef Bacik authored
      Wade Cline reported a problem where he was getting garbage and warnings when
      writing to a preallocated range via O_DIRECT.  This is because we weren't
      creating our normal pinned extent_map for the range we were writing to,
      which was causing all sorts of issues.  This patch fixes the problem and
      makes his testcase much happier.  Thanks,
      Reported-by: default avatarWade Cline <clinew@linux.vnet.ibm.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      69ffb543
    • Josef Bacik's avatar
      Btrfs: move the sb_end_intwrite until after the throttle logic · 6df7881a
      Josef Bacik authored
      Sage reported the following lockdep backtrace
      
      =====================================
      [ BUG: bad unlock balance detected! ]
      3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
      -------------------------------------
      btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
      [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
      but there are no more locks to release!
      
      other info that might help us debug this:
      1 lock held by btrfs-cleaner/7607:
       #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]
      
      stack backtrace:
      Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
      Call Trace:
       [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
       [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
       [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
       [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
       [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff810b2a95>] lock_release+0xd5/0x220
       [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
       [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
       [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
       [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
       [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
       [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
       [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
       [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
       [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
       [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
       [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
       [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
       [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
       [<ffffffff810791ee>] kthread+0xae/0xc0
       [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
       [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
       [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
       [<ffffffff8163e740>] ? gs_change+0x13/0x13
      
      This is because the throttle stuff can commit the transaction, which expects to
      be the one stopping the intwrite stuff, but we've already done it in the
      __btrfs_end_transaction.  Moving the sb_end_intewrite after this logic makes the
      lockdep go away.  Thanks,
      Tested-by: default avatarSage Weil <sage@inktank.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      6df7881a
    • Liu Bo's avatar
      Btrfs: use larger limit for translation of logical to inode · 425d17a2
      Liu Bo authored
      This is the change of the kernel side.
      
      Translation of logical to inode used to have an upper limit 4k on
      inode container's size, but the limit is not large enough for a data
      with a great many of refs, so when resolving logical address,
      we can end up with
      "ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"
      
      This changes to regard 64k as the upper limit and use vmalloc instead of
      kmalloc to get memory more easily.
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      425d17a2
    • Liu Bo's avatar
      Btrfs: use helper for logical resolve · df031f07
      Liu Bo authored
      We already have a helper, iterate_inodes_from_logical(), for logical resolve,
      so just use it.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      df031f07
    • Liu Bo's avatar
      Btrfs: fix a bug in parsing return value in logical resolve · 69917e43
      Liu Bo authored
      In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.
      
      It is possible to lose our errors because
      (-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.
      
      I'm not sure if it is on purpose, it just looks too hacky if it is.
      I'd rather use a real flag and a 'ret' to catch errors.
      Acked-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: default avatarLiu Bo <liub.liubo@gmail.com>
      69917e43
    • Liu Bo's avatar
      Btrfs: update delayed ref's tracepoints to show sequence · dea7d76e
      Liu Bo authored
      We've added a new field 'sequence' to delayed ref node, so update related
      tracepoints.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      dea7d76e
    • liubo's avatar
      Btrfs: cleanup for unused ref cache stuff · 0647d6bd
      liubo authored
      As ref cache has been removed from btrfs, there is no user on
      its lock and its check.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      0647d6bd
    • Miao Xie's avatar
      Btrfs: fix corrupted metadata in the snapshot · 8407aa46
      Miao Xie authored
      When we delete a inode, we will remove all the delayed items including delayed
      inode update, and then truncate all the relative metadata. If there is lots of
      metadata, we will end the current transaction, and start a new transaction to
      truncate the left metadata. In this way, we will leave a inode item that its
      link counter is > 0, and also may leave some directory index items in fs/file tree
      after the current transaction ends. In other words, the metadata in this fs/file tree
      is inconsistent. If we create a snapshot for this tree now, we will find a inode with
      corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
      because its link counter is not 0.
      
      We fix this problem by updating the inode item before the current transaction ends.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      8407aa46
    • David Sterba's avatar
      btrfs: polish names of kmem caches · 837e1972
      David Sterba authored
      Usecase:
      
        watch 'grep btrfs < /proc/slabinfo'
      
      easy to watch all caches in one go.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      837e1972
    • Josef Bacik's avatar
      Btrfs: fix our overcommit math · a80c8dcf
      Josef Bacik authored
      I noticed I was seeing large lags when running my torrent test in a vm on my
      laptop.  While trying to make it lag less I noticed that our overcommit math
      was taking into account the number of bytes we wanted to reclaim, not the
      number of bytes we actually wanted to allocate, which means we wouldn't
      overcommit as often.  This patch fixes the overcommit math and makes
      shrink_delalloc() use that logic so that it will stop looping faster.  We
      still have pretty high spikes of latency, but the test now takes 3 minutes
      less time (about 5% faster).  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      a80c8dcf
    • Josef Bacik's avatar
      Btrfs: wait on async pages when shrinking delalloc · dea31f52
      Josef Bacik authored
      Mitch reported a problem where you could get an ENOSPC error when untarring
      a kernel git tree onto a 16gb file system with compress-force=zlib.  This is
      because compression is a huge pain, it will return from ->writepages()
      without having actually created any ordered extents.  To get around this we
      check to see if the async submit counter is up, and if it is wait until it
      drops to 0 before doing our normal ordered wait dance.  With this patch I
      can now untar a kernel git tree onto a 16gb file system without getting
      ENOSPC errors.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      dea31f52
    • Liu Bo's avatar
      Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag · 9e8a4a8b
      Liu Bo authored
      We're going to use this flag EXTENT_DEFRAG to indicate which range
      belongs to defragment so that we can implement snapshow-aware defrag:
      
      We set the EXTENT_DEFRAG flag when dirtying the extents that need
      defragmented, so later on writeback thread can differentiate between
      normal writeback and writeback started by defragmentation.
      Original-Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      9e8a4a8b
    • Tsutomu Itoh's avatar
      Btrfs: check return value of ulist_alloc() properly · 3d6b5c3b
      Tsutomu Itoh authored
      ulist_alloc() has the possibility of returning NULL.
      So, it is necessary to check the return value.
      Signed-off-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      3d6b5c3b
    • Tsutomu Itoh's avatar
      Btrfs: fix error handling in delete_block_group_cache() · f54fb859
      Tsutomu Itoh authored
      btrfs_iget() never return NULL.
      So, NULL check is unnecessary.
      Signed-off-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      f54fb859
    • Miao Xie's avatar
      Btrfs: fix wrong size for the reservation when doing, file pre-allocation. · 903889f4
      Miao Xie authored
      When we ran fsstress(a program in xfstests), the filesystem hung up when it
      is full. It was because the space reserved in btrfs_fallocate() was wrong,
      btrfs_fallocate() just used the size of the pre-allocation to reserve the
      space, didn't took the block size aligning into account, so the size of
      the reserved space was less than the allocated space, it caused the over
      reserve problem and made the filesystem hung up when invoking cow_file_range().
      Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      903889f4
    • Miao Xie's avatar
      Btrfs: output more information when aborting a unused transaction handle · 69ce977a
      Miao Xie authored
      Though we dump the stack information when aborting a unused transaction
      handle, we don't know the correct place where we decide to abort the
      transaction handle if one function has several place where the transaction
      abort function is invoked and jumps to the same place after this call.
      And beside that we also don't know the reason why we jump to abort
      the current handle. So I modify the transaction abort function and make
      it output the function name, line and error information.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      69ce977a
    • Miao Xie's avatar
      Btrfs: fix unprotected ->log_batch · 2ecb7923
      Miao Xie authored
      We forget to protect ->log_batch when syncing a file, this patch fix
      this problem by atomic operation. And ->log_batch is used to check
      if there are parallel sync operations or not, so it is unnecessary to
      reset it to 0 after the sync operation of the current log tree complete.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      2ecb7923
    • Miao Xie's avatar
      Btrfs: fix wrong size for the reservation of the, snapshot creation · 48c03c4b
      Miao Xie authored
      We should insert/update 6 items(root ref, root backref, dir item, dir index,
      root item and parent inode) when creating a snapshot, not 5 items, fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      48c03c4b
    • Miao Xie's avatar
      Btrfs: fix the snapshot that should not exist · 42874b3d
      Miao Xie authored
      The snapshot should be the image of the fs tree before it was created,
      so the metadata of the snapshot should not exist in the its tree. But now, we
      found the directory item and directory name index is in both the snapshot tree
      and the fs tree. It introduces some problems and makes the users feel strange:
      
       # mkfs.btrfs /dev/sda1
       # mount /dev/sda1 /mnt
       # mkdir /mnt/1
       # cd /mnt/1
       # btrfs subvolume snapshot /mnt snap0
       # ls -a /mnt/1/snap0/1
       .	..	[no other file/dir]
      
       # ll /mnt/1/snap0/
       total 0
       drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
      			^^^
      			There is no file/dir in it, but it's size is 10
      
       # cd /mnt/1/snap0/1/snap0
       [Enter a unexisted directory successfully...]
      
      There is nothing in the directory 1 in snap0, but btrfs told the length of
      this directory is 10. Beside that, we can enter an unexisted directory, it is
      very strange to the users.
      
       # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
       # ll /mnt/1/snap0/1/
       total 0
       [None]
       # ll /mnt/snap1/1/
       total 0
       drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0
      
      And the source of snap1 did have any directory in Directory 1, but snap1 have
      a snap0, it is different between the source and the snapshot.
      
      So I think we should insert directory item and directory name index and update
      the parent inode as the last step of snapshot creation, and do not leave the
      useless metadata in the file tree.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      42874b3d
    • Miao Xie's avatar
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie authored
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • Miao Xie's avatar
      Btrfs: use a slab for ordered extents allocation · 6352b91d
      Miao Xie authored
      The ordered extent allocation is in the fast path of the IO, so use a slab
      to improve the speed of the allocation.
      
       "Size of the struct is 280, so this will fall into the size-512 bucket,
        giving 8 objects per page, while own slab will pack 14 objects into a page.
      
        Another benefit I see is to check for leaked objects when the module is
        removed (and the cache destroy takes place)."
      						-- David Sterba
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      6352b91d
    • Miao Xie's avatar
      Btrfs: fix file extent discount problem in the, snapshot · b9a8cc5b
      Miao Xie authored
      If a snapshot is created while we are writing some data into the file,
      the i_size of the corresponding file in the snapshot will be wrong, it will
      be beyond the end of the last file extent. And btrfsck will report:
        root 256 inode 257 errors 100
      
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
       # for ((i=0; i<4; i++))
       > do
       > btrfs sub snap . $i
       > done
      
      This because the algorithm of disk_i_size update is wrong. Though there are
      some ordered extents behind the current one which we use to update disk_i_size,
      it doesn't mean those extents will be dealt with in the same transaction. So
      We shouldn't use the offset of those extents to update disk_i_size. Or we will
      get the wrong i_size in the snapshot.
      
      We fix this problem by recording the max real i_size. If we find there is a
      ordered extent which is in front of the current one and doesn't complete, we
      will record the end of the current one into that ordered extent. Surely, if
      the current extent holds the end of other extent(it must be greater than
      the current one because it is behind the current one), we will record the
      number that the current extent holds. In this way, we can exclude the ordered
      extents that may not be dealth with in the same transaction, and be easy to
      know the real disk_i_size.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      b9a8cc5b
    • Miao Xie's avatar
      Btrfs: fix full backref problem when inserting shared block reference · 361048f5
      Miao Xie authored
      If we create several snapshots at the same time, the following BUG_ON() will be
      triggered.
      
      	kernel BUG at fs/btrfs/extent-tree.c:6047!
      
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
       # for ((i=0; i<4; i++))
       > do
       > mkdir $i
       > for ((j=0; j<200; j++))
       > do
       > btrfs sub snap . $i/$j
       > done &
       > done
      
      The reason is:
      Before transaction commit, some operations changed the fs tree and new tree
      blocks were allocated because of COW. We used the implicit non-shared back
      reference for those newly allocated tree blocks because they were not shared by
      two or more trees.
      
      And then we created the first snapshot for the fs tree, according to the back
      reference rules, we also used implicit back refs for the child tree blocks of
      the root node of the fs tree, now those child nodes/leaves were shared by two
      trees.
      
      Then We didn't deal with the delayed references, and continued to change the fs
      tree(created the second snapshot and inserted the dir item of the new snapshot
      into the fs tree). According to the rules of the back reference, we added full
      back refs for those tree blocks whose parents have be shared by two trees.
      Now some newly allocated tree blocks had two types of the references.
      
      As we know, the delayed reference system handles these delayed references from
      back to front, and the full delayed reference is inserted after the implicit
      ones. So when we dealt with the back references of those newly allocated tree
      blocks, the full references was dealt with at first. And if the first reference
      is a shared back reference and the tree block that the reference points to is
      newly allocated, It would be considered as a tree block which is shared by two
      or more trees when it is allocated and should be a full back reference not a
      implicit one, the flag of its reference also should be set to FULL_BACKREF.
      But in fact, it was a non-shared tree block with a implicit reference at
      beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
      was triggered.
      
      We have several methods to fix this bug:
      1. deal with delayed references after the snapshot is created and before we
         change the source tree of the snapshot. This is the easiest and safest way.
      2. modify the sort method of the delayed reference tree, make the full delayed
         references be inserted before the implicit ones. It is also very easy, but
         I don't know if it will introduce some problems or not.
      3. modify select_delayed_ref() and make it select the implicit delayed reference
         at first. This way is not so good because it may wastes CPU time if we have
         lots of delayed references.
      4. set the flags to FULL_BACKREF, this method is a little complex comparing with
         the 1st way.
      
      I chose the 1st way to fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      361048f5
    • Miao Xie's avatar
      Btrfs: fix error path in create_pending_snapshot() · 6fa9700e
      Miao Xie authored
      This patch fixes the following problem:
      - If we failed to deal with the delayed dir items, we should abort transaction,
        just as its comment said. Fix it.
      - If root reference or root back reference insertion failed, we should
        abort transaction. Fix it.
      - Fix the double free problem of pending->inherit.
      - Do not restore the trans->rsv if we doesn't change it.
      - make the error path more clearly.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      6fa9700e
    • Wei Yongjun's avatar
      Btrfs: fix possible memory leak in scrub_setup_recheck_block() · cf93dcce
      Wei Yongjun authored
      bbio has been malloced in btrfs_map_block() and should be
      freed before leaving from the error handling cases.
      
      spatch with a semantic match is used to found this problem.
      (http://coccinelle.lip6.fr/)
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      cf93dcce
    • Josef Bacik's avatar
      Btrfs: btrfs_drop_extent_cache should never fail · 7014cdb4
      Josef Bacik authored
      I noticed this when I was doing the fsync stuff, we allocate split extents if we
      drop an extent range that is in the middle of an existing extent.  This BUG()'s
      if we fail to allocate memory, but the fact is this is just a cache, we will
      just regenerate the cache if we need it, the important part is that we free the
      range we are given.  This can be done without allocations, so if we fail to
      allocate splits just skip the splitting stage and free our em and look for more
      extents to drop.  This also makes btrfs_drop_extent_cache a void since nobody
      was checking the return value anyway.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      7014cdb4
    • Sage Weil's avatar
      Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() · ac14aed6
      Sage Weil authored
      Josef has suggested that this is not necessary.  Removing it also avoids
      this lockdep splat (after the new sb_internal locking stuff was added):
      
      [  604.090449] ======================================================
      [  604.114819] [ INFO: possible circular locking dependency detected ]
      [  604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted
      [  604.162193] -------------------------------------------------------
      [  604.186139] btrfs-cleaner/6669 is trying to acquire lock:
      [  604.209555]  (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
      [  604.257100]
      [  604.257100] but task is already holding lock:
      [  604.300366]  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
      [  604.352989]
      [  604.352989] which lock already depends on the new lock.
      [  604.352989]
      [  604.427104]
      [  604.427104] the existing dependency chain (in reverse order) is:
      [  604.478493]
      [  604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}:
      [  604.529313]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
      [  604.559621]        [<ffffffff81632b69>] down_read+0x39/0x4e
      [  604.589382]        [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs]
      [  604.596161] btrfs: unlinked 1 orphans
      [  604.675002]        [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs]
      [  604.708859]        [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs]
      [  604.772466]        [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs]
      [  604.842245]        [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs]
      [  604.912852]        [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs]
      [  604.951888]        [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560
      [  604.989961]        [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0
      [  605.026628]        [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b
      [  605.064404]
      [  605.064404] -> #0 (sb_internal#2){.+.+..}:
      [  605.126832]        [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
      [  605.163671]        [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
      [  605.200228]        [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
      [  605.236818]        [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
      [  605.274029]        [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
      [  605.340520]        [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
      [  605.378720]        [<ffffffff811972c8>] evict+0xb8/0x1c0
      [  605.416057]        [<ffffffff811974d5>] iput+0x105/0x210
      [  605.452373]        [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
      [  605.521627]        [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
      [  605.560520]        [<ffffffff810791ee>] kthread+0xae/0xc0
      [  605.598094]        [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
      [  605.636499]
      [  605.636499] other info that might help us debug this:
      [  605.636499]
      [  605.736504]  Possible unsafe locking scenario:
      [  605.736504]
      [  605.801931]        CPU0                    CPU1
      [  605.835126]        ----                    ----
      [  605.867093]   lock(&fs_info->cleanup_work_sem);
      [  605.898594]                                lock(sb_internal#2);
      [  605.931954]                                lock(&fs_info->cleanup_work_sem);
      [  605.965359]   lock(sb_internal#2);
      [  605.994758]
      [  605.994758]  *** DEADLOCK ***
      [  605.994758]
      [  606.075281] 2 locks held by btrfs-cleaner/6669:
      [  606.104528]  #0:  (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs]
      [  606.165626]  #1:  (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs]
      [  606.231297]
      [  606.231297] stack backtrace:
      [  606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1
      [  606.347823] Call Trace:
      [  606.376184]  [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c
      [  606.409243]  [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90
      [  606.441343]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
      [  606.474583]  [<ffffffff810b2c82>] lock_acquire+0xa2/0x140
      [  606.505934]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
      [  606.539429]  [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
      [  606.571719]  [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0
      [  606.603498]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
      [  606.637405]  [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs]
      [  606.670165]  [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160
      [  606.702144]  [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs]
      [  606.735562]  [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs]
      [  606.769861]  [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs]
      [  606.804575]  [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs]
      [  606.838756]  [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
      [  606.872010]  [<ffffffff811972c8>] evict+0xb8/0x1c0
      [  606.903800]  [<ffffffff811974d5>] iput+0x105/0x210
      [  606.935416]  [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs]
      [  606.970510]  [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs]
      [  607.005648]  [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs]
      [  607.040724]  [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
      [  607.104740]  [<ffffffff810791ee>] kthread+0xae/0xc0
      [  607.137119]  [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
      [  607.169797]  [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
      [  607.202472]  [<ffffffff81635430>] ? retint_restore_args+0x13/0x13
      [  607.235884]  [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
      [  607.268731]  [<ffffffff8163e740>] ? gs_change+0x13/0x13
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
      ac14aed6
    • Sage Weil's avatar
      Btrfs: set journal_info in async trans commit worker · e209db7a
      Sage Weil authored
      We expect current->journal_info to point to the trans handle we are
      committing.
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
      e209db7a
    • Sage Weil's avatar
      Btrfs: pass lockdep rwsem metadata to async commit transaction · 6fc4e354
      Sage Weil authored
      The freeze rwsem is taken by sb_start_intwrite() and dropped during the
      commit_ or end_transaction().  In the async case, that happens in a worker
      thread.  Tell lockdep the calling thread is releasing ownership of the
      rwsem and the async thread is picking it up.
      
      XFS plays the same trick in fs/xfs/xfs_aops.c.
      Signed-off-by: default avatarSage Weil <sage@inktank.com>
      6fc4e354
    • Josef Bacik's avatar
      Btrfs: add hole punching · 2aaa6655
      Josef Bacik authored
      This patch adds hole punching via fallocate.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      2aaa6655
    • Josef Bacik's avatar
      Btrfs: remove unused hint byte argument for btrfs_drop_extents · 2671485d
      Josef Bacik authored
      I audited all users of btrfs_drop_extents and found that nobody actually uses
      the hint_byte argument.  I'm sure it was used for something at some point but
      it's not used now, and the way the pinning works the disk bytenr would never be
      immediately useful anyway so lets just remove it.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      2671485d
    • Liu Bo's avatar
      Btrfs: check if an inode has no checksum when logging it · d2794405
      Liu Bo authored
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum
      items any more.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      d2794405
    • Liu Bo's avatar
      Btrfs: fix a bug in checking whether a inode is already in log · 46d8bc34
      Liu Bo authored
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The current btrfs checks if an inode is in log by comparing
      root's last_log_commit to inode's last_sub_trans[2].
      
      But the problem is that this root->last_log_commit is shared among
      inodes.
      
      Say we have N inodes to be logged, after the first inode,
      root's last_log_commit is updated and the N-1 remained files will
      be skipped.
      
      This fixes the bug by keeping a local copy of root's last_log_commit
      inside each inode and this local copy will be maintained itself.
      
      [1]: we regard each log transaction as a subset of btrfs's transaction,
      i.e. sub_trans
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      46d8bc34
    • Miao Xie's avatar
      Btrfs: fix wrong orphan count of the fs/file tree · 321f0e70
      Miao Xie authored
      If we add a new orphan item, we should increase the atomic counter,
      not decrease it. Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      321f0e70
    • Liu Bo's avatar
      Btrfs: improve fsync by filtering extents that we want · 4e2f84e6
      Liu Bo authored
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The above Josef's patch performs very good in random sync write test,
      because we won't have too much extents to merge.
      
      However, it does not performs good on the test:
      dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync
      
      The reason is when we do sequencial sync write, we need to merge the
      current extent just with the previous one, so that we can get accumulated
      extents to log:
      
      A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...
      
      So we'll have to flush more and more checksum into log tree, which is the
      bottleneck according to my tests.
      
      But we can avoid this by telling fsync the real extents that are needed
      to be logged.
      
      With this, I did the above dd sync write test (size=50m),
      
               w/o (orig)   w/ (josef's)   w/ (this)
      SATA      104KB/s       109KB/s       121KB/s
      ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      4e2f84e6
    • Josef Bacik's avatar
      Btrfs: do not needlessly restart the transaction for enospc · ca7e70f5
      Josef Bacik authored
      We will stop and restart a transaction every time we move to a different leaf
      when truncating a file.  This is for enospc reasons, but really we could
      probably get away with doing this a little better by actually working until we
      hit an ENOSPC.  So add a ->failfast flag to the block_rsv and set it when we do
      truncates which will fail as soon as the block rsv runs out of space, and then
      at that point we can stop and restart the transaction and refill the block rsv
      and carry on.  This will make rm'ing of a file with lots of extents a bit
      faster.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      ca7e70f5
    • Liu Bo's avatar
      Btrfs: cleanup extents after we finish logging inode · 06d3d22b
      Liu Bo authored
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      We should cleanup those extents after we've finished logging inode,
      otherwise we may do redundant work on them.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      06d3d22b
    • Josef Bacik's avatar
      Btrfs: only warn if we hit an error when doing the tree logging · 0fa83cdb
      Josef Bacik authored
      I hit this a couple times while working on my fsync patch (all my bugs, not
      normal operation), but with my new stuff we could have new errors from cases
      I have not encountered, so instead of BUG()'ing we should be WARN()'ing so
      that we are notified there is a problem but the user doesn't lose their
      data.  We can easily commit the transaction in the case that the tree
      logging fails and still be fine, so let's try and be as nice to the user as
      possible.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      0fa83cdb
    • Josef Bacik's avatar
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik authored
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      5dc562c5
    • Josef Bacik's avatar
      Btrfs: fix possible corruption when fsyncing written prealloced extents · 224ecce5
      Josef Bacik authored
      While working on my fsync patch my fsync tester kept hitting mismatching
      md5sums when I would randomly write to a prealloc'ed region, syncfs() and
      then write to the prealloced region some more and then fsync() and then
      immediately reboot.  This is because the tree logging code will skip writing
      csums for file extents who's generation is less than the current running
      transaction.  When we mark extents as written we haven't been updating their
      generation so they were always being skipped.  This wouldn't happen if you
      were to preallocate and then write in the same transaction, but if you for
      example prealloced a VM you could definitely run into this problem.  This
      patch makes my fsync tester happy again.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      224ecce5