1. 27 Jun, 2016 6 commits
    • Benjamin Marzinski's avatar
      gfs2: writeout truncated pages · fd4c5748
      Benjamin Marzinski authored
      When gfs2 attempts to write a page to a file that is being truncated,
      and notices that the page is completely outside of the file size, it
      tries to invalidate it.  However, this may require a transaction for
      journaled data files to revoke any buffers from the page on the active
      items list. Unfortunately, this can happen inside a log flush, where a
      transaction cannot be started. Also, gfs2 may need to be able to remove
      the buffer from the ail1 list before it can finish the log flush.
      
      To deal with this, when writing a page of a file with data journalling
      enabled gfs2 now skips the check to see if the write is outside the file
      size, and simply writes it anyway. This situation can only occur when
      the truncate code still has the file locked exclusively, and hasn't
      marked this block as free in the metadata (which happens later in
      truc_dealloc).  After gfs2 writes this page out, the truncation code
      will shortly invalidate it and write out any revokes if necessary.
      
      To do this, gfs2 now implements its own version of block_write_full_page
      without the check, and calls the newly exported __block_write_full_page.
      It also no longer calls gfs2_writepage_common from gfs2_jdata_writepage.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      fd4c5748
    • Benjamin Marzinski's avatar
      fs: export __block_write_full_page · b4bba389
      Benjamin Marzinski authored
      gfs2 needs to be able to skip the check to see if a page is outside of
      the file size when writing it out. gfs2 can get into a situation where
      it needs to flush its in-memory log to disk while a truncate is in
      progress. If the file being trucated has data journaling enabled, it is
      possible that there are data blocks in the log that are past the end of
      the file. gfs can't finish the log flush without either writing these
      blocks out or revoking them. Otherwise, if the node crashed, it could
      overwrite subsequent changes made by other nodes in the cluster when
      it's journal was replayed.
      
      Unfortunately, there is no way to add log entries to the log during a
      flush. So gfs2 simply writes out the page instead. This situation can
      only occur when the truncate code still has the file locked exclusively,
      and hasn't marked this block as free in the metadata (which happens
      later in truc_dealloc).  After gfs2 writes this page out, the truncation
      code will shortly invalidate it and write out any revokes if necessary.
      
      In order to make this work, gfs2 needs to be able to skip the check for
      writes outside the file size. Since the check exists in
      block_write_full_page, this patch exports __block_write_full_page, which
      doesn't have the check.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      b4bba389
    • Andreas Gruenbacher's avatar
      gfs2: Lock holder cleanup · 6df9f9a2
      Andreas Gruenbacher authored
      Make the code more readable by cleaning up the different ways of
      initializing lock holders and checking for initialized lock holders:
      mark lock holders as uninitialized by setting the holder's glock to NULL
      (gfs2_holder_mark_uninitialized) instead of zeroing out the entire
      object or using a separate flag.  Recognize initialized holders by their
      non-NULL glock (gfs2_holder_initialized).  Don't zero out holder objects
      which are immeditiately initialized via gfs2_holder_init or
      gfs2_glock_nq_init.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      6df9f9a2
    • Andreas Gruenbacher's avatar
      gfs2: Large-filesystem fix for 32-bit systems · cda9dd42
      Andreas Gruenbacher authored
      Commit ff34245d switched from iget5_locked to iget_locked among other
      things, but iget_locked doesn't work for filesystems larger than 2^32
      blocks on 32-bit systems.  Switch back to iget5_locked.  Filesystems
      larger than 2^32 blocks are unrealistic to work well on 32-bit systems,
      so this is mostly a code cleanliness fix.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      cda9dd42
    • Andreas Gruenbacher's avatar
      gfs2: Get rid of gfs2_ilookup · ec5ec66b
      Andreas Gruenbacher authored
      Now that gfs2_lookup_by_inum only takes the inode glock for new inodes
      (and not for cached inodes anymore), there no longer is a need to
      optimize the cached-inode case in gfs2_get_dentry or delete_work_func,
      and gfs2_ilookup can be removed.
      
      In addition, gfs2_get_dentry wasn't checking the GFS2_DIF_SYSTEM flag in
      i_diskflags in the gfs2_ilookup case (see gfs2_lookup_by_inum); this
      inconsistency goes away as well.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      ec5ec66b
    • Andreas Gruenbacher's avatar
      gfs2: Fix gfs2_lookup_by_inum lock inversion · 3ce37b2c
      Andreas Gruenbacher authored
      The current gfs2_lookup_by_inum takes the glock of a presumed inode
      identified by block number, verifies that the block is indeed an inode,
      and then instantiates and reads the new inode via gfs2_inode_lookup.
      
      However, instantiating a new inode may block on freeing a previous
      instance of that inode (__wait_on_freeing_inode), and freeing an inode
      requires to take the glock already held, leading to lock inversion and
      deadlock.
      
      Fix this by first instantiating the new inode, then verifying that the
      block is an inode (if required), and then reading in the new inode, all
      in gfs2_inode_lookup.
      
      If the block we are looking for is not an inode, we discard the new
      inode via iget_failed, which marks inodes as bad and unhashes them.
      Other tasks waiting on that inode will get back a bad inode back from
      ilookup or iget_locked; in that case, retry the lookup.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      3ce37b2c
  2. 17 Jun, 2016 1 commit
  3. 10 Jun, 2016 1 commit
    • Bob Peterson's avatar
      GFS2: don't set rgrp gl_object until it's inserted into rgrp tree · 36e4ad03
      Bob Peterson authored
      Before this patch, function read_rindex_entry would set a rgrp
      glock's gl_object pointer to itself before inserting the rgrp into
      the rgrp rbtree. The problem is: if another process was also reading
      the rgrp in, and had already inserted its newly created rgrp, then
      the second call to read_rindex_entry would overwrite that value,
      then return a bad return code to the caller. Later, other functions
      would reference the now-freed rgrp memory by way of gl_object.
      In some cases, that could result in gfs2_rgrp_brelse being called
      twice for the same rgrp: once for the failed attempt and once for
      the "real" rgrp release. Eventually the kernel would panic.
      There are also a number of other things that could go wrong when
      a kernel module is accessing freed storage. For example, this could
      result in rgrp corruption because the fake rgrp would point to a
      fake bitmap in memory too, causing gfs2_inplace_reserve to search
      some random memory for free blocks, and find some, since we were
      never setting rgd->rd_bits to NULL before freeing it.
      
      This patch fixes the problem by not setting gl_object until we
      have successfully inserted the rgrp into the rbtree. Also, it sets
      rd_bits to NULL as it frees them, which will ensure any accidental
      access to the wrong rgrp will result in a kernel panic rather than
      file system corruption, which is preferred.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      36e4ad03
  4. 24 May, 2016 32 commits