1. 22 Jan, 2015 17 commits
    • Zhao Lei's avatar
      Btrfs: sort raid_map before adding tgtdev stripes · cc7539ed
      Zhao Lei authored
      It can avoid complex calculation of real stripes in sort,
      moreover, we can clean up code of sorting tgtdev_map because it
      will be in order initially.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cc7539ed
    • Zhao Lei's avatar
      Btrfs: fix a out-of-bound access of raid_map · e34c330d
      Zhao Lei authored
      We add the number of stripes on target devices into bbio->num_stripes
      if we are under device replacement, and we just sort the raid_map of
      those stripes that not on the target devices, so if when we need
      real raid_map, we need skip the stripes on the target devices.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e34c330d
    • Filipe Manana's avatar
      Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs · df8d116f
      Filipe Manana authored
      If we have an inode with a large number of hard links, some of which may
      be extrefs, turn a regular ref into an extref, fsync the inode and then
      replay the fsync log (after a crash/reboot), we can endup with an fsync
      log that makes the replay code always fail with -EOVERFLOW when processing
      the inode's references.
      
      This is easy to reproduce with the test case I made for xfstests. Its steps
      are the following:
      
         _scratch_mkfs "-O extref" >> $seqres.full 2>&1
         _init_flakey
         _mount_flakey
      
         # Create a test file with 3001 hard links. This number is large enough to
         # make btrfs start using extrefs at some point even if the fs has the maximum
         # possible leaf/node size (64Kb).
         echo "hello world" > $SCRATCH_MNT/foo
         for i in `seq 1 3000`; do
             ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
         done
      
         # Make sure all metadata and data are durably persisted.
         sync
      
         # Now remove one link, add a new one with a new name, add another new one with
         # the same name as the one we just removed and fsync the inode.
         rm -f $SCRATCH_MNT/foo_link_0001
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
         rm -f $SCRATCH_MNT/foo_link_0002
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
         $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
         # Simulate a crash/power loss. This makes sure the next mount
         # will see an fsync log and will replay that log.
      
         _load_flakey_table $FLAKEY_DROP_WRITES
         _unmount_flakey
      
         _load_flakey_table $FLAKEY_ALLOW_WRITES
         _mount_flakey
      
         # Check that the number of hard links is correct, we are able to remove all
         # the hard links and read the file's data. This is just to verify we don't
         # get stale file handle errors (due to dangling directory index entries that
         # point to inodes that no longer exist).
         echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
         [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
         for ((i = 1; i <= 3003; i++)); do
             name=foo_link_`printf "%04d" $i`
             if [ $i -eq 2 ]; then
                 [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
             else
                 [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
             fi
         done
         rm -f $SCRATCH_MNT/foo_link_*
         cat $SCRATCH_MNT/foo
         rm -f $SCRATCH_MNT/foo
      
         status=0
         exit
      
      The fix is simply to correct the overflow condition when overwriting a
      reference item because it was wrong, trying to increase the item in the
      fs/subvol tree by an impossible amount. Also ensure that we don't insert
      one normal ref and one ext ref for the same dentry - this happened because
      processing a dir index entry from the parent in the log happened when
      the normal ref item was full, which made the logic insert an extref and
      later when the normal ref had enough room, it would be inserted again
      when processing the ref item from the child inode in the log.
      
      This issue has been present since the introduction of the extrefs feature
      (2012).
      
      A test case for xfstests follows soon. This test only passes if the previous
      patch titled "Btrfs: fix fsync when extend references are added to an inode"
      is applied too.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      df8d116f
    • Filipe Manana's avatar
      Btrfs: fix fsync when extend references are added to an inode · 2c2c452b
      Filipe Manana authored
      If we added an extended reference to an inode and fsync'ed it, the log
      replay code would make our inode have an incorrect link count, which
      was lower then the expected/correct count.
      This resulted in stale directory index entries after deleting some of
      the hard links, and any access to the dangling directory entries resulted
      in -ESTALE errors because the entries pointed to inode items that don't
      exist anymore.
      
      This is easy to reproduce with the test case I made for xfstests, and
      the bulk of that test is:
      
          _scratch_mkfs "-O extref" >> $seqres.full 2>&1
          _init_flakey
          _mount_flakey
      
          # Create a test file with 3001 hard links. This number is large enough to
          # make btrfs start using extrefs at some point even if the fs has the maximum
          # possible leaf/node size (64Kb).
          echo "hello world" > $SCRATCH_MNT/foo
          for i in `seq 1 3000`; do
              ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
          done
      
          # Make sure all metadata and data are durably persisted.
          sync
      
          # Add one more link to the inode that ends up being a btrfs extref and fsync
          # the inode.
          ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
          $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
          # Simulate a crash/power loss. This makes sure the next mount
          # will see an fsync log and will replay that log.
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          # Now after the fsync log replay btrfs left our inode with a wrong link count N,
          # which was smaller than the correct link count M (N < M).
          # So after removing N hard links, the remaining M - N directory entries were
          # still visible to user space but it was impossible to do anything with them
          # because they pointed to an inode that didn't exist anymore. This resulted in
          # stale file handle errors (-ESTALE) when accessing those dentries for example.
          #
          # So remove all hard links except the first one and then attempt to read the
          # file, to verify we don't get an -ESTALE error when accessing the inodel
          #
          # The btrfs fsck tool also detected the incorrect inode link count and it
          # reported an error message like the following:
          #
          # root 5 inode 257 errors 2001, no inode item, link count wrong
          #   unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref
          #
          # The fstests framework automatically calls fsck after a test is run, so we
          # don't need to call fsck explicitly here.
      
          rm -f $SCRATCH_MNT/foo_link_*
          cat $SCRATCH_MNT/foo
      
          status=0
          exit
      
      So make sure an fsync always flushes the delayed inode item, so that the
      fsync log contains it (needed in order to trigger the link count fixup
      code) and fix the extref counting function, which always return -ENOENT
      to its caller (and made it assume there were always 0 extrefs).
      
      This issue has been present since the introduction of the extrefs feature
      (2012).
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2c2c452b
    • Filipe Manana's avatar
      Btrfs: fix directory inconsistency after fsync log replay · d36808e0
      Filipe Manana authored
      If we have an inode (file) with a link count greater than 1, remove
      one of its hard links, fsync the inode, power fail/crash and then
      replay the fsync log on the next mount, we end up getting the parent
      directory's metadata inconsistent - its i_size still reflects the
      deleted hard link and has dangling index entries (with no matching
      inode reference entries). This prevents the directory from ever being
      deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE
      even if all of its children inodes are deleted, and the dangling index
      entries can never be removed (as they point to an inode that does not
      exist anymore).
      
      This is easy to reproduce with the following excerpt from the test case
      for xfstests that I just made:
      
          _scratch_mkfs >> $seqres.full 2>&1
      
          _init_flakey
          _mount_flakey
      
          # Create a test file with 2 hard links in the same directory.
          mkdir -p $SCRATCH_MNT/a/b
          echo "hello world" > $SCRATCH_MNT/a/b/foo
          ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar
      
          # Make sure all metadata and data are durably persisted.
          sync
      
          # Now remove one of the hard links and fsync the inode.
          rm -f $SCRATCH_MNT/a/b/bar
          $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo
      
          # Simulate a crash/power loss. This makes sure the next mount
          # will see an fsync log and will replay that log.
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          # Remove the last hard link of the file and attempt to remove its parent
          # directory - this failed in btrfs because the fsync log and replay code
          # didn't decrement the parent directory's i_size and left dangling directory
          # index entries - this made the btrfs rmdir implementation always fail with
          # the error -ENOTEMPTY.
          #
          # The dangling directory index entries were visible to user space, but it was
          # impossible to do anything on them (unlink, open, read, write, stat, etc)
          # because the inode they pointed to did not exist anymore.
          #
          # The parent directory's metadata inconsistency (stale index entries) was
          # also detected by btrfs' fsck tool, which is run automatically by the fstests
          # framework when the test finishes. The error message reported by fsck was:
          #
          # root 5 inode 259 errors 2001, no inode item, link count wrong
          #   unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref
          #
          rm -f $SCRATCH_MNT/a/b/*
          rmdir $SCRATCH_MNT/a/b
          rmdir $SCRATCH_MNT/a
      
      To fix this just make sure that after an unlink, if the inode is fsync'ed,
      he parent inode is fully logged in the fsync log.
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d36808e0
    • Filipe Manana's avatar
      Btrfs: lookup for block group only if needed when freeing a tree block · 6219872d
      Filipe Manana authored
      Very often our extent buffer's header generation doesn't match the current
      transaction's id or it is also referenced by other trees (snapshots), so
      we don't need the corresponding block group cache object. Therefore only
      search for it if we are going to use it, so we avoid an unnecessary search
      in the block groups rbtree (and acquiring and releasing its spinlock).
      
      Freeing a tree block is performed when COWing or deleting a node/leaf,
      which implies we are holding the node/leaf's parent node lock, therefore
      reducing the amount of time spent when freeing a tree block helps reducing
      the amount of time we are holding the parent node's lock.
      
      For example, for a run of xfstests/generic/083, the block group cache
      object was needed only 682 times for a total of 226691 calls to free
      a tree block.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6219872d
    • David Sterba's avatar
      btrfs: remove a no-op unfreeze superbock callback · 730a78c7
      David Sterba authored
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      730a78c7
    • David Sterba's avatar
      btrfs: switch extent_state state to unsigned · 9ee49a04
      David Sterba authored
      Currently there's a 4B hole in the structure between refs and state and there
      are only 16 bits used so we can make it unsigned. This will get a better
      packing and may save some stack space for local variables.
      
      The size of extent_state gets reduced by 8B and there are usually a lot
      of slab objects.
      
      struct extent_state {
      	u64                        start;                /*     0     8 */
      	u64                        end;                  /*     8     8 */
      	struct rb_node             rb_node;              /*    16    24 */
      	wait_queue_head_t          wq;                   /*    40    24 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	atomic_t                   refs;                 /*    64     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	long unsigned int          state;                /*    72     8 */
      	u64                        private;              /*    80     8 */
      
      	/* size: 88, cachelines: 2, members: 7 */
      	/* sum members: 84, holes: 1, sum holes: 4 */
      	/* last cacheline: 24 bytes */
      };
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9ee49a04
    • David Sterba's avatar
      btrfs: set proper message level for skinny metadata · 5efa0490
      David Sterba authored
      This has been confusing people for too long, the message is really just
      informative.
      
      CC: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5efa0490
    • David Sterba's avatar
      btrfs: update message levels after checksum errors · f0954c66
      David Sterba authored
      The errors are worth noting and might get missed with INFO level.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f0954c66
    • David Sterba's avatar
      btrfs: update message levels during failed mount · aa8ee312
      David Sterba authored
      All error conditions from open_ctree shall be ERR. Warning would
      suggest that something's wrong and we can continue.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      aa8ee312
    • David Sterba's avatar
      btrfs: update message levels for errors · 68b663d1
      David Sterba authored
      Several messages that point to some internal problem, level INFO is
      wrong here.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      68b663d1
    • Filipe Manana's avatar
      Btrfs: fix setup_leaf_for_split() to avoid leaf corruption · a8df6fe6
      Filipe Manana authored
      We were incorrectly detecting when the target key didn't exist anymore
      after releasing the path and re-searching the tree. This could make
      us split or duplicate (btrfs_split_item() and btrfs_duplicate_item()
      are its only callers at the moment) an item when we should not.
      
      For the case of duplicating an item, we currently only duplicate
      checksum items (csum tree) and file extent items (fs/subvol trees).
      For the checksum items we end up overriding the item completely,
      but for file extent items we update only some of their fields in
      the copy (done in __btrfs_drop_extents), which means we can end up
      having a logical corruption for some values.
      
      Also for the case where we duplicate a file extent item it will make
      us produce a leaf with a wrong key order, as btrfs_duplicate_item()
      advances us to the next slot and then its caller sets a smaller key
      on the new item at that slot (like in __btrfs_drop_extents() e.g.).
      Alternatively if the tree search in setup_leaf_for_split() leaves
      with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end
      up accessing beyond the leaf's end (when we check if the item's size
      has changed) and make our caller insert an item at the invalid slot
      btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory
      access if the leaf is full or nearly full.
      
      This issue has been present since the introduction of this function
      in 2009:
      
          Btrfs: Add btrfs_duplicate_item
          commit ad48fd75Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a8df6fe6
    • Chris Mason's avatar
      Merge branch 'cleanup/blocksize-diet-part2' of... · 57bbddd7
      Chris Mason authored
      Merge branch 'cleanup/blocksize-diet-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
      57bbddd7
    • Chris Mason's avatar
      Merge branch 'fix/find-item-path-leak' of... · d3541834
      Chris Mason authored
      Merge branch 'fix/find-item-path-leak' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
      d3541834
    • Josef Bacik's avatar
      Btrfs: track dirty block groups on their own list · ce93ec54
      Josef Bacik authored
      Currently any time we try to update the block groups on disk we will walk _all_
      block groups and check for the ->dirty flag to see if it is set.  This function
      can get called several times during a commit.  So if you have several terabytes
      of data you will be a very sad panda as we will loop through _all_ of the block
      groups several times, which makes the commit take a while which slows down the
      rest of the file system operations.
      
      This patch introduces a dirty list for the block groups that we get added to
      when we dirty the block group for the first time.  Then we simply update any
      block groups that have been dirtied since the last time we called
      btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
      free space cache out so it is much cleaner.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ce93ec54
    • Josef Bacik's avatar
      Btrfs: change how we track dirty roots · e7070be1
      Josef Bacik authored
      I've been overloading root->dirty_list to keep track of dirty roots and which
      roots need to have their commit roots switched at transaction commit time.  This
      could cause us to lose an update to the root which could corrupt the file
      system.  To fix this use a state bit to know if the root is dirty, and if it
      isn't set we go ahead and move the root to the dirty list.  This way if we
      re-dirty the root after adding it to the switch_commit list we make sure to
      update it.  This also makes it so that the extent root is always the last root
      on the dirty list to try and keep the amount of churn down at this point in the
      commit.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e7070be1
  2. 18 Jan, 2015 4 commits
    • Linus Torvalds's avatar
      Linux 3.19-rc5 · ec6f34e5
      Linus Torvalds authored
      ec6f34e5
    • Linus Torvalds's avatar
      Merge tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · d0ac5d8e
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "We've been sitting on our fixes branch for a while, so this batch is
        unfortunately on the large side.
      
        A lot of these are tweaks and fixes to device trees, fixing various
        bugs around clocks, reg ranges, etc.  There's also a few defconfig
        updates (which are on the late side, no more of those).
      
        All in all the diffstat is bigger than ideal at this time, but nothing
        in here seems particularly risky"
      
      * tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (31 commits)
        reset: sunxi: fix spinlock initialization
        ARM: dts: disable CCI on exynos5420 based arndale-octa
        drivers: bus: check cci device tree node status
        ARM: rockchip: disable jtag/sdmmc autoswitching on rk3288
        ARM: nomadik: fix up leftover device tree pins
        ARM: at91: board-dt-sama5: add phy_fixup to override NAND_Tree
        ARM: at91/dt: sam9263: Add missing clocks to lcdc node
        ARM: at91: sama5d3: dt: correct the sound route
        ARM: at91/dt: sama5d4: fix the timer reg length
        ARM: exynos_defconfig: Enable LM90 driver
        ARM: exynos_defconfig: Enable options for display panel support
        arm: dts: Use pmu_system_controller phandle for dp phy
        ARM: shmobile: sh73a0 legacy: Set .control_parent for all irqpin instances
        ARM: dts: berlin: correct BG2Q's SM GPIO location.
        ARM: dts: berlin: add broken-cd and set bus width for eMMC in Marvell DMP DT
        ARM: dts: berlin: fix io clk and add missing core clk for BG2Q sdhci2 host
        ARM: dts: Revert disabling of smc91x for n900
        ARM: dts: imx51-babbage: Fix ULPI PHY reset modelling
        ARM: dts: dra7-evm: fix qspi device tree partition size
        ARM: omap2plus_defconfig: use CONFIG_CPUFREQ_DT
        ...
      d0ac5d8e
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux · 12ba8571
      Linus Torvalds authored
      Pull clock driver fixes from Mike Turquette:
       "Small number of fixes for clock drivers and a single null pointer
        dereference fix in the framework core code.
      
        The driver fixes vary from fixing section mismatch warnings to
        preventing machines from hanging (and preventing developers from
        crying)"
      
      * tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux:
        clk: fix possible null pointer dereference
        Revert "clk: ppc-corenet: Fix Section mismatch warning"
        clk: rockchip: fix deadlock possibility in cpuclk
        clk: berlin: bg2q: remove non-exist "smemc" gate clock
        clk: at91: keep slow clk enabled to prevent system hang
        clk: rockchip: fix rk3288 cpuclk core dividers
        clk: rockchip: fix rk3066 pll lock bit location
        clk: rockchip: Fix clock gate for rk3188 hclk_emem_peri
        clk: rockchip: add CLK_IGNORE_UNUSED flag to fix rk3066/rk3188 USB Host
      12ba8571
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 901b2082
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "This is one fix for a Multiqueue sleeping in invalid context problem
        and a MAINTAINER file update for Qlogic"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: ->queue_rq can't sleep
        MAINTAINERS: Update maintainer list for qla4xxx
      901b2082
  3. 17 Jan, 2015 18 commits
  4. 16 Jan, 2015 1 commit