Commits · 70261f568f3c08552f034742e3d5cb78c3877766 · Kirill Smelkov / linux

28 Aug, 2013 3 commits

ext4: Fix misspellings using 'codespell' tool · 70261f56

Anatol Pomozov authored Aug 28, 2013

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

70261f56

ext4: convert write_begin methods to stable_page_writes semantics · 7afe5aa5

Dmitry Monakhov authored Aug 28, 2013

Use wait_for_stable_page() instead of wait_on_page_writeback()
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>

7afe5aa5

ext4: fix use of potentially uninitialized variables in debugging code · 27b1b228

Andi Shyti authored Aug 28, 2013

If ext_debugging is enabled and path[depth].p_ext is NULL, len
and lblock are printed non initialized
Signed-off-by: Andi Shyti <andi@etezian.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

27b1b228

17 Aug, 2013 15 commits

ext4: fix lost truncate due to race with writeback · 90e775b7

Jan Kara authored Aug 17, 2013

The following race can lead to a loss of i_disksize update from truncate
thus resulting in a wrong inode size if the inode size isn't updated
again before inode is reclaimed:

ext4_setattr()				mpage_map_and_submit_extent()
  EXT4_I(inode)->i_disksize = attr->ia_size;
  ...					  ...
					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
					  /* False because i_size isn't
					   * updated yet */
					  if (disksize > i_size_read(inode))
					  /* True, because i_disksize is
					   * already truncated */
					  if (disksize > EXT4_I(inode)->i_disksize)
					    /* Overwrite i_disksize
					     * update from truncate */
					    ext4_update_i_disksize()
  i_size_write(inode, attr->ia_size);

For other places updating i_disksize such race cannot happen because
i_mutex prevents these races. Writeback is the only place where we do
not hold i_mutex and we cannot grab it there because of lock ordering.

We fix the race by doing both i_disksize and i_size update in truncate
atomically under i_data_sem and in mpage_map_and_submit_extent() we move
the check against i_size under i_data_sem as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

90e775b7

ext4: simplify truncation code in ext4_setattr() · 5208386c

Jan Kara authored Aug 17, 2013

Merge conditions in ext4_setattr() handling inode size changes, also
move ext4_begin_ordered_truncate() call somewhat earlier because it
simplifies error recovery in case of failure. Also add error handling in
case i_disksize update fails.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

5208386c

ext4: fix ext4_writepages() in presence of truncate · 5f1132b2

Jan Kara authored Aug 17, 2013

Inode size can arbitrarily change while writeback is in progress. When
ext4_writepages() has prepared a long extent for mapping and truncate
then reduces i_size, mpage_map_and_submit_buffers() will always map just
one buffer in a page instead of all of them due to lblk < blocks check.
So we end up not using all blocks we've allocated (thus leaking them)
and also delalloc accounting goes wrong manifesting as a warning like:

ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
with only 0 reserved data blocks

Note that the problem can happen only when blocksize < pagesize because
otherwise we have only a single buffer in the page.

Fix the problem by removing the size check from the mapping loop. We
have an extent allocated so we have to use it all before checking for
i_size. We also rename add_page_bufs_to_extent() to
mpage_process_page_bufs() and make that function submit the page for IO
if all buffers (upto EOF) in it are mapped.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

5f1132b2

ext4: move test whether extent to map can be extended to one place · 09930042

Jan Kara authored Aug 17, 2013

Currently the logic whether the current buffer can be added to an extent
of buffers to map is split between mpage_add_bh_to_extent() and
add_page_bufs_to_extent(). Move the whole logic to
mpage_add_bh_to_extent() which makes things a bit more straightforward
and make following i_size fixes easier.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

09930042

ext4: fix warning in ext4_da_update_reserve_space() · 7d734532

Jan Kara authored Aug 17, 2013

reaim workfile.dbase test easily triggers warning in
ext4_da_update_reserve_space():

EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
blocks with reserved 9 data blocks)

The problem is that (one of) tests creates file and then randomly writes
to it with O_SYNC. That results in writing back pages of the file in
random order so we create extents for written blocks say 0, 2, 4, 6, 8
- this last allocation also allocates new block for extents. Then we
writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
indirect extent block because extents fit in the inode again. Then we
writeout block 10 and we need to allocate indirect extent block again
which triggers the warning because we don't have the reservation
anymore.

Fix the problem by giving back freed metadata blocks resulting from
extent merging into inode's reservation pool.
Signed-off-by: Jan Kara <jack@suse.cz>

7d734532

quota: provide interface for readding allocated space into reserved space · 1c8924eb

Jan Kara authored Aug 17, 2013

ext4 needs to convert allocated (metadata) blocks back into blocks
reserved for delayed allocation. Add functions into quota code for
supporting such operation.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

1c8924eb

ext4: avoid reusing recently deleted inodes in no journal mode · 19883bd9

Theodore Ts'o authored Aug 16, 2013

In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away.  Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk.  However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.

E2fsck will make sure the file system is consistent after the
unclean shutdown.  However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues.  We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.

Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

19883bd9

ext4: allocate delayed allocation blocks before rename · 0e202704

Theodore Ts'o authored Aug 16, 2013

When ext4_rename() overwrites an already existing file, call
ext4_alloc_da_blocks() before starting the journal handle which
actually does the rename, instead of doing this afterwards.  This
improves the likelihood that the contents will survive a crash if an
application replaces a file using the sequence:

1)  write replacement contents to foo.new
2)  <omit fsync of foo.new>
3)  rename foo.new to foo

It is still not a guarantee, since ext4_alloc_da_blocks() is *not*
doing a file integrity sync; this means if foo.new is a very large
file, it may not be completely flushed out to disk.

However, for files smaller than a megabyte or so, any dirty pages
should be flushed out before we do the rename operation, and so at the
next journal commit, the CACHE FLUSH command will make sure al of
these pages are safely on the disk platter.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

0e202704

ext4: start handle at least possible moment when renaming files · 5b61de75

Theodore Ts'o authored Aug 16, 2013

In ext4_rename(), don't start the journal handle until the the
directory entries have been successfully looked up.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

5b61de75

ext4: add support for extent pre-caching · 7869a4a6

Theodore Ts'o authored Aug 16, 2013

Add a new fiemap flag which forces the all of the extents in an inode
to be cached in the extent_status tree. This is critically important
when using AIO to a preallocated file, since if we need to read in
blocks from the extent tree, the io_submit(2) system call becomes
synchronous, and the AIO is no longer "A", which is bad.

In addition, for most files which have an external leaf tree block,
the cost of caching the information in the extent status tree will be
less than caching the entire 4k block in the buffer cache. So it is
generally a win to keep the extent information cached.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

7869a4a6

ext4: cache all of an extent tree's leaf block upon reading · 107a7bd3

Theodore Ts'o authored Aug 16, 2013

When we read in an extent tree leaf block from disk, arrange to have
all of its entries cached.  In nearly all cases the in-memory
representation will be more compact than the on-disk representation in
the buffer cache, and it allows us to get the information without
having to traverse the extent tree for successive extents.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

107a7bd3

ext4: use unsigned int for es_status values · 3be78c73

Theodore Ts'o authored Aug 16, 2013

Don't use an unsigned long long for the es_status flags; this requires
that we pass 64-bit values around which is painful on 32-bit systems.
Instead pass the extent status flags around using the low 4 bits of an
unsigned int, and shift them into place when we are reading or writing
es_pblk.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

3be78c73

ext4: print the block number of invalid extent tree blocks · c349179b

Theodore Ts'o authored Aug 16, 2013

When we find an invalid extent tree block, report the block number of
the bad block for debugging purposes.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

c349179b

ext4: refactor code to read the extent tree block · 7d7ea89e

Theodore Ts'o authored Aug 16, 2013

Refactor out the code needed to read the extent tree block into a
single read_extent_tree_block() function.  In addition to simplifying
the code, it also makes sure that we call the ext4_ext_load_extent
tracepoint whenever we need to read an extent tree block from disk.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

7d7ea89e

jbd2: Fix oops in jbd2_journal_file_inode() · a361293f

Jan Kara authored Aug 16, 2013

Commit 0713ed0c added
jbd2_journal_file_inode() call into ext4_block_zero_page_range().
However that function gets called from truncate path and thus inode
needn't have jinode attached - that happens in ext4_file_open() but
the file needn't be ever open since mount. Calling
jbd2_journal_file_inode() without jinode attached results in the oops.

We fix the problem by attaching jinode to inode also in ext4_truncate()
and ext4_punch_hole() when we are going to zero out partial blocks.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

a361293f

12 Aug, 2013 2 commits

jbd2: Fix use after free after error in jbd2_journal_dirty_metadata() · 91aa11fa

Jan Kara authored Aug 12, 2013

When jbd2_journal_dirty_metadata() returns error,
__ext4_handle_dirty_metadata() stops the handle. However callers of this
function do not count with that fact and still happily used now freed
handle. This use after free can result in various issues but very likely
we oops soon.

The motivation of adding __ext4_journal_stop() into
__ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
improve error reporting. So replace __ext4_journal_stop() with
ext4_journal_abort_handle() which was there before that commit and add
WARN_ON_ONCE() to dump stack to provide useful information.
Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org	# 3.2+

91aa11fa

ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT · cde2d7a7

Theodore Ts'o authored Aug 12, 2013

Previously we weren't swapping only some of the extent_status LRU
fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl.  The
much safer thing to do is to just completely flush the extent status
tree when doing the swap.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: stable@vger.kernel.org

cde2d7a7

09 Aug, 2013 2 commits

ext4: fix mount/remount error messages for incompatible mount options · 6ae6514b

Piotr Sarna authored Aug 08, 2013

Commit 56889787 ("ext4: improve handling of conflicting mount options")
introduced incorrect messages shown while choosing wrong mount options.

First of all, both cases of incorrect mount options,
"data=journal,delalloc" and "data=journal,dioread_nolock" result in
the same error message.

Secondly, the problem above isn't solved for remount option: the
mismatched parameter is simply ignored.  Moreover, ext4_msg states
that remount with options "data=journal,delalloc" succeeded, which is
not true.

To fix it up, I added a simple check after parse_options() call to
ensure that data=journal and delalloc/dioread_nolock parameters are
not present at the same time.
Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

6ae6514b

ext4: allow the mount options nodelalloc and data=journal · 59d9fa5c

Theodore Ts'o authored Aug 08, 2013

Commit 26092bf5 ("ext4: use a table-driven handler for mount options")
wrongly disallows the specifying the mount options nodelalloc and
data=journal simultaneously.  This is incorrect; it should have only
disallowed the combination of delalloc and data=journal
simultaneously.
Reported-by: Piotr Sarna <p.sarna@partner.samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

59d9fa5c

29 Jul, 2013 2 commits

ext4: add WARN_ON to check the length of allocated blocks · 44fb851d

Zheng Liu authored Jul 29, 2013

In commit 921f266b: ext4: add self-testing infrastructure to do a
sanity check, some sanity checks were added in map_blocks to make sure
'retval == map->m_len'.

Enable these checks by default and report any assertion failures using
ext4_warning() and WARN_ON() since they can help us to figure out some
bugs that are otherwise hard to hit.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

44fb851d

ext4: fix retry handling in ext4_ext_truncate() · 94eec0fc

Theodore Ts'o authored Jul 29, 2013

We tested for ENOMEM instead of -ENOMEM.   Oops.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

94eec0fc

26 Jul, 2013 2 commits

ext4: destroy ext4_es_cachep on module unload · dd12ed14

Eric Sandeen authored Jul 26, 2013

Without this, module can't be reloaded.

[  500.521980] kmem_cache_sanity_check (ext4_extent_status): Cache name already exists.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org  # v3.8+

dd12ed14

ext4: make sure group number is bumped after a inode allocation race · a34eb503

Theodore Ts'o authored Jul 26, 2013

When we try to allocate an inode, and there is a race between two
CPU's trying to grab the same inode, _and_ this inode is the last free
inode in the block group, make sure the group number is bumped before
we continue searching the rest of the block groups.  Otherwise, we end
up searching the current block group twice, and we end up skipping
searching the last block group.  So in the unlikely situation where
almost all of the inodes are allocated, it's possible that we will
return ENOSPC even though there might be free inodes in that last
block group.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org

a34eb503

21 Jul, 2013 5 commits

Linux 3.11-rc2 · 3b2f64d0
Linus Torvalds authored Jul 21, 2013

3b2f64d0

Merge tag 'acpi-video-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · ea45ea70

Linus Torvalds authored Jul 21, 2013

Pull ACPI video support fixes from Rafael Wysocki:
 "I'm sending a separate pull request for this as it may be somewhat
  controversial.  The breakage addressed here is not really new and the
  fixes may not satisfy all users of the affected systems, but we've had
  so much back and forth dance in this area over the last several weeks
  that I think it's time to actually make some progress.

  The source of the problem is that about a year ago we started to tell
  BIOSes that we're compatible with Windows 8, which we really need to
  do, because some systems shipping with Windows 8 are tested with it
  and nothing else, so if we tell their BIOSes that we aren't compatible
  with Windows 8, we expose our users to untested BIOS/AML code paths.

  However, as it turns out, some Windows 8-specific AML code paths are
  not tested either, because Windows 8 actually doesn't use the ACPI
  methods containing them, so if we declare Windows 8 compatibility and
  attempt to use those ACPI methods, things break.  That occurs mostly
  in the backlight support area where in particular the _BCM and _BQC
  methods are plain unusable on some systems if the OS declares Windows
  8 compatibility.

  [ The additional twist is that they actually become usable if the OS
    says it is not compatible with Windows 8, but that may cause
    problems to show up elsewhere ]

  Investigation carried out by Matthew Garrett indicates that what
  Windows 8 does about backlight is to leave backlight control up to
  individual graphics drivers.  At least there's evidence that it does
  that if the Intel graphics driver is used, so we've decided to follow
  Windows 8 in that respect and allow i915 to control backlight (Daniel
  likes that part).

  The first commit from Aaron Lu makes ACPICA export the variable from
  which we can infer whether or not the BIOS believes that we are
  compatible with Windows 8.

  The second commit from Matthew Garrett prepares the ACPI video driver
  by making it initialize the ACPI backlight even if it is not going to
  be used afterward (that is needed for backlight control to work on
  Thinkpads).

  The third commit implements the actual workaround making i915 take
  over backlight control if the firmware thinks it's dealing with
  Windows 8 and is based on the work of multiple developers, including
  Matthew Garrett, Chun-Yi Lee, Seth Forshee, and Aaron Lu.

  The final commit from Aaron Lu makes us follow Windows 8 by informing
  the firmware through the _DOS method that it should not carry out
  automatic brightness changes, so that brightness can be controlled by
  GUI.

  Hopefully, this approach will allow us to avoid using blacklists of
  systems that should not declare Windows 8 compatibility just to avoid
  backlight control problems in the future.

   - Change from Aaron Lu makes ACPICA export a variable which can be
     used by driver code to determine whether or not the BIOS believes
     that we are compatible with Windows 8.

   - Change from Matthew Garrett makes the ACPI video driver initialize
     the ACPI backlight even if it is not going to be used afterward
     (that is needed for backlight control to work on Thinkpads).

   - Fix from Rafael J Wysocki implements Windows 8 backlight support
     workaround making i915 take over bakclight control if the firmware
     thinks it's dealing with Windows 8.  Based on the work of multiple
     developers including Matthew Garrett, Chun-Yi Lee, Seth Forshee,
     and Aaron Lu.

   - Fix from Aaron Lu makes the kernel follow Windows 8 by informing
     the firmware through the _DOS method that it should not carry out
     automatic brightness changes, so that brightness can be controlled
     by GUI"

* tag 'acpi-video-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / video: no automatic brightness changes by win8-compatible firmware
  ACPI / video / i915: No ACPI backlight if firmware expects Windows 8
  ACPI / video: Always call acpi_video_init_brightness() on init
  ACPICA: expose OSI version

ea45ea70

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 90db76e8

Linus Torvalds authored Jul 20, 2013

Pull ext[34] tmpfile bugfix from Ted Ts'o:
 "Fix regression caused by commit af51a2ac which added ->tmpfile()
  support (along with a similar fix for ext3)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext3: fix a BUG when opening a file with O_TMPFILE flag
  ext4: fix a BUG when opening a file with O_TMPFILE flag

90db76e8

ext3: fix a BUG when opening a file with O_TMPFILE flag · dda5690d

Zheng Liu authored Jul 20, 2013

When we try to open a file with O_TMPFILE flag, we will trigger a bug.
The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
this check always fails because we set ->i_nlink = 1 in
inode_init_always().  We can use the following program to trigger it:

int main(int argc, char *argv[])
{
	int fd;

	fd = open(argv[1], O_TMPFILE, 0666);
	if (fd < 0) {
		perror("open ");
		return -1;
	}
	close(fd);
	return 0;
}

The oops message looks like this:

kernel: kernel BUG at fs/ext3/namei.c:1992!
kernel: invalid opcode: 0000 [#1] SMP
kernel: Modules linked in: ext4 jbd2 crc16 cpufreq_ondemand ipv6 dm_mirror dm_region_hash dm_log dm_mod parport_pc parport serio_raw sg dcdbas pcspkr i2c_i801 ehci_pci ehci_hcd button acpi_cpufreq mperf e1000e ptp pps_core ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core ext3 jbd sd_mod ahci libahci libata scsi_mod uhci_hcd
kernel: CPU: 0 PID: 2882 Comm: tst_tmpfile Not tainted 3.11.0-rc1+ #4
kernel: Hardware name: Dell Inc. OptiPlex 780 /0V4W66, BIOS A05 08/11/2010
kernel: task: ffff880112d30050 ti: ffff8801124d4000 task.ti: ffff8801124d4000
kernel: RIP: 0010:[<ffffffffa00db5ae>] [<ffffffffa00db5ae>] ext3_orphan_add+0x6a/0x1eb [ext3]
kernel: RSP: 0018:ffff8801124d5cc8  EFLAGS: 00010202
kernel: RAX: 0000000000000000 RBX: ffff880111510128 RCX: ffff8801114683a0
kernel: RDX: 0000000000000000 RSI: ffff880111510128 RDI: ffff88010fcf65a8
kernel: RBP: ffff8801124d5d18 R08: 0080000000000000 R09: ffffffffa00d3b7f
kernel: R10: ffff8801114683a0 R11: ffff8801032a2558 R12: 0000000000000000
kernel: R13: ffff88010fcf6800 R14: ffff8801032a2558 R15: ffff8801115100d8
kernel: FS:  00007f5d172b5700(0000) GS:ffff880117c00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
kernel: CR2: 00007f5d16df15d0 CR3: 0000000110b1d000 CR4: 00000000000407f0
kernel: Stack:
kernel: 000000000000000c ffff8801048a7dc8 ffff8801114685a8 ffffffffa00b80d7
kernel: ffff8801124d5e38 ffff8801032a2558 ffff88010ce24d68 0000000000000000
kernel: ffff88011146b300 ffff8801124d5d44 ffff8801124d5d78 ffffffffa00db7e1
kernel: Call Trace:
kernel: [<ffffffffa00b80d7>] ? journal_start+0x8c/0xbd [jbd]
kernel: [<ffffffffa00db7e1>] ext3_tmpfile+0xb2/0x13b [ext3]
kernel: [<ffffffff821076f8>] path_openat+0x11f/0x5e7
kernel: [<ffffffff821c86b4>] ? list_del+0x11/0x30
kernel: [<ffffffff82065fa2>] ?  __dequeue_entity+0x33/0x38
kernel: [<ffffffff82107cd5>] do_filp_open+0x3f/0x8d
kernel: [<ffffffff82112532>] ? __alloc_fd+0x50/0x102
kernel: [<ffffffff820f9296>] do_sys_open+0x13b/0x1cd
kernel: [<ffffffff820f935c>] SyS_open+0x1e/0x20
kernel: [<ffffffff82398c02>] system_call_fastpath+0x16/0x1b
kernel: Code: 39 c7 0f 85 67 01 00 00 0f b7 03 25 00 f0 00 00 3d 00 40 00 00 74 18 3d 00 80 00 00 74 11 3d 00 a0 00 00 74 0a 83 7b 48 00 74 04 <0f> 0b eb fe 49 8b 85 50 03 00 00 4c 89 f6 48 c7 c7 c0 99 0e a0
kernel: RIP  [<ffffffffa00db5ae>] ext3_orphan_add+0x6a/0x1eb [ext3]
kernel: RSP <ffff8801124d5cc8>

Here we couldn't call clear_nlink() directly because in d_tmpfile() we
will call inode_dec_link_count() to decrease ->i_nlink.  So this commit
tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>

dda5690d

ext4: fix a BUG when opening a file with O_TMPFILE flag · e94bd349

Zheng Liu authored Jul 20, 2013

When we try to open a file with O_TMPFILE flag, we will trigger a bug.
The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
this check always fails because we set ->i_nlink = 1 in
inode_init_always().  We can use the following program to trigger it:

int main(int argc, char *argv[])
{
	int fd;

	fd = open(argv[1], O_TMPFILE, 0666);
	if (fd < 0) {
		perror("open ");
		return -1;
	}
	close(fd);
	return 0;
}

The oops message looks like this:

kernel BUG at fs/ext4/namei.c:2572!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: dlci bridge stp hidp cmtp kernelcapi l2tp_ppp l2tp_netlink l2tp_core sctp libcrc32c rfcomm tun fuse nfnetli
nk can_raw ipt_ULOG can_bcm x25 scsi_transport_iscsi ipx p8023 p8022 appletalk phonet psnap vmw_vsock_vmci_transport af_key vmw_vmci rose vsock atm can netrom ax25 af_rxrpc ir
da pppoe pppox ppp_generic slhc bluetooth nfc rfkill rds caif_socket caif crc_ccitt af_802154 llc2 llc snd_hda_codec_realtek snd_hda_intel snd_hda_codec serio_raw snd_pcm pcsp
kr edac_core snd_page_alloc snd_timer snd soundcore r8169 mii sr_mod cdrom pata_atiixp radeon backlight drm_kms_helper ttm
CPU: 1 PID: 1812571 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #12
Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
task: ffff88007dfe69a0 ti: ffff88010f7b6000 task.ti: ffff88010f7b6000
RIP: 0010:[<ffffffff8125ce69>]  [<ffffffff8125ce69>] ext4_orphan_add+0x299/0x2b0
RSP: 0018:ffff88010f7b7cf8  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8800966d3020 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88007dfe70b8 RDI: 0000000000000001
RBP: ffff88010f7b7d40 R08: ffff880126a3c4e0 R09: ffff88010f7b7ca0
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271fd668
R13: ffff8800966d2f78 R14: ffff88011d7089f0 R15: ffff88007dfe69a0
FS:  00007f70441a3740(0000) GS:ffff88012a800000(0000) knlGS:00000000f77c96c0
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002834000 CR3: 0000000107964000 CR4: 00000000000007e0
DR0: 0000000000780000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
 0000000000002000 00000020810b6dde 0000000000000000 ffff88011d46db00
 ffff8800966d3020 ffff88011d7089f0 ffff88009c7f4c10 ffff88010f7b7f2c
 ffff88007dfe69a0 ffff88010f7b7da8 ffffffff8125cfac ffff880100000004
Call Trace:
 [<ffffffff8125cfac>] ext4_tmpfile+0x12c/0x180
 [<ffffffff811cba78>] path_openat+0x238/0x700
 [<ffffffff8100afc4>] ? native_sched_clock+0x24/0x80
 [<ffffffff811cc647>] do_filp_open+0x47/0xa0
 [<ffffffff811db73f>] ? __alloc_fd+0xaf/0x200
 [<ffffffff811ba2e4>] do_sys_open+0x124/0x210
 [<ffffffff81010725>] ? syscall_trace_enter+0x25/0x290
 [<ffffffff811ba3ee>] SyS_open+0x1e/0x20
 [<ffffffff816ca8d4>] tracesys+0xdd/0xe2
 [<ffffffff81001001>] ? start_thread_common.constprop.6+0x1/0xa0
Code: 04 00 00 00 89 04 24 31 c0 e8 c4 77 04 00 e9 43 fe ff ff 66 25 00 d0 66 3d 00 80 0f 84 0e fe ff ff 83 7b 48 00 0f 84 04 fe ff ff <0f> 0b 49 8b 8c 24 50 07 00 00 e9 88 fe ff ff 0f 1f 84 00 00 00

Here we couldn't call clear_nlink() directly because in d_tmpfile() we
will call inode_dec_link_count() to decrease ->i_nlink.  So this commit
tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>

e94bd349

20 Jul, 2013 7 commits

Merge tag 'staging-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · f6a0d9d5

Linus Torvalds authored Jul 20, 2013

Pull staging tree fixes from Greg KH:
 "Here are a few iio driver fixes for 3.11-rc2.  They are still spread
  across drivers/iio and drivers/staging/iio so they are coming in
  through this tree.

  I've also removed the drivers/staging/csr/ driver as the developers
  who originally sent it to me have moved on to other companies, and CSR
  still will not send us the specs for the device, making the driver
  pretty much obsolete and impossible to fix up.  Deleting it now
  prevents people from sending in lots of tiny codingsyle fixes that
  will never go anywhere.

  It also helps to offset the large lustre filesystem merge that
  happened in 3.11-rc1 in the overall 3.11.0 diffstat.  :)"

* tag 'staging-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: csr: remove driver
  iio: lps331ap: Fix wrong in_pressure_scale output value
  iio staging: fix lis3l02dq, read error handling
  staging:iio:ad7291: add missing .driver_module to struct iio_info
  iio: ti_am335x_adc: add missing .driver_module to struct iio_info
  iio: mxs-lradc: Remove useless check in read_raw
  iio: mxs-lradc: Fix misuse of iio->trig
  iio: inkern: fix iio_convert_raw_to_processed_unlocked
  iio: Fix iio_channel_has_info
  iio:trigger: device_unregister->device_del to avoid double free
  iio: dac: ad7303: fix error return code in ad7303_probe()

f6a0d9d5

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 36231d25

Linus Torvalds authored Jul 20, 2013

Pull vfs fixes from Al Viro:
 "The sget() one is a long-standing bug and will need to go into -stable
  (in fact, it had been originally caught in RHEL6), the other two are
  3.11-only"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: constify dentry parameter in d_count()
  livelock avoidance in sget()
  allow O_TMPFILE to work with O_WRONLY

36231d25

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 19bf1c2c

Linus Torvalds authored Jul 20, 2013

Pull ext4 bugfixes from Ted Ts'o:
 "Fixes for 3.11-rc2, sent at 5pm, in the professoinal style.  :-)"

I'm not sure I like this new level of "professionalism".
9-5, people, 9-5.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: call ext4_es_lru_add() after handling cache miss
  ext4: yield during large unlinks
  ext4: make the extent_status code more robust against ENOMEM failures
  ext4: simplify calculation of blocks to free on error
  ext4: fix error handling in ext4_ext_truncate()

19bf1c2c

Merge tag 'nfs-for-3.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 3be542d4

Linus Torvalds authored Jul 20, 2013

Pull NFS client bugfixes from Trond Myklebust:
 - Fix a regression against NFSv4 FreeBSD servers when creating a new
   file
 - Fix another regression in rpc_client_register()

* tag 'nfs-for-3.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix a regression against the FreeBSD server
  SUNRPC: Fix another issue with rpc_client_register()

3be542d4

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next · 90290c4e

Linus Torvalds authored Jul 20, 2013

Pull btrfs fixes from Josef Bacik:
 "I'm playing the role of Chris Mason this week while he's on vacation.
  There are a few critical fixes for btrfs here, all regressions and
  have been tested well"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next:
  Btrfs: fix wrong write offset when replacing a device
  Btrfs: re-add root to dead root list if we stop dropping it
  Btrfs: fix lock leak when resuming snapshot deletion
  Btrfs: update drop progress before stopping snapshot dropping

90290c4e

vfs: constify dentry parameter in d_count() · 24924a20

Peng Tao authored Jul 18, 2013

so that it can be used in places like d_compare/d_hash
without causing a compiler warning.
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

24924a20

livelock avoidance in sget() · acfec9a5

Al Viro authored Jul 20, 2013

Eric Sandeen has found a nasty livelock in sget() - take a mount(2) about
to fail.  The superblock is on ->fs_supers, ->s_umount is held exclusive,
->s_active is 1.  Along comes two more processes, trying to mount the same
thing; sget() in each is picking that superblock, bumping ->s_count and
trying to grab ->s_umount.  ->s_active is 3 now.  Original mount(2)
finally gets to deactivate_locked_super() on failure; ->s_active is 2,
superblock is still ->fs_supers because shutdown will *not* happen until
->s_active hits 0.  ->s_umount is dropped and now we have two processes
chasing each other:
s_active = 2, A acquired ->s_umount, B blocked
A sees that the damn thing is stillborn, does deactivate_locked_super()
s_active = 1, A drops ->s_umount, B gets it
A restarts the search and finds the same superblock.  And bumps it ->s_active.
s_active = 2, B holds ->s_umount, A blocked on trying to get it
... and we are in the earlier situation with A and B switched places.

The root cause, of course, is that ->s_active should not grow until we'd
got MS_BORN.  Then failing ->mount() will have deactivate_locked_super()
shut the damn thing down.  Fortunately, it's easy to do - the key point
is that grab_super() is called only for superblocks currently on ->fs_supers,
so it can bump ->s_count and grab ->s_umount first, then check MS_BORN and
bump ->s_active; we must never increment ->s_count for superblocks past
->kill_sb(), but grab_super() is never called for those.

The bug is pretty old; we would've caught it by now, if not for accidental
exclusion between sget() for block filesystems; the things like cgroup or
e.g. mtd-based filesystems don't have anything of that sort, so they get
bitten.  The right way to deal with that is obviously to fix sget()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

acfec9a5

19 Jul, 2013 2 commits

allow O_TMPFILE to work with O_WRONLY · ba57ea64
Al Viro authored Jul 20, 2013
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
ba57ea64

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · d471ce53

Linus Torvalds authored Jul 19, 2013

Pull UML fixes from Richard Weinberger:
 "Special thanks goes to Toralf Föster for continuously testing UML and
  reporting issues!"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: remove dead code
  um: siginfo cleanup
  uml: Fix which_tmpdir failure when /dev/shm is a symlink, and in other edge cases
  um: Fix wait_stub_done() error handling
  um: Mark stub pages mapping with VM_PFNMAP
  um: Fix return value of strnlen_user()

d471ce53