Commits · d79698da32b317e96216236f265a9b72b78ae568 · nexedi / linux

26 Jul, 2011 23 commits

ceph: document unlocked d_parent accesses · d79698da

Sage Weil authored Jul 26, 2011

For the most part we don't care about racing with rename when directing
MDS requests; either the old or new parent is fine.  Document that, and
do some minor cleanup.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

d79698da

ceph: explicitly reference rename old_dentry parent dir in request · 41b02e1f

Sage Weil authored Jul 26, 2011

We carry a pin on the parent directory for the rename source and dest
dentries.  For the source it's r_locked_dir; we need to explicitly
reference the old_dentry parent as well, since the dentry's d_parent may
change between when the request was created and pinned and when it is
freed.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

41b02e1f

ceph: document locking for ceph_set_dentry_offset · 4f177264

Sage Weil authored Jul 26, 2011

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

4f177264

ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug · e5f86dc3

Sage Weil authored Jul 26, 2011

Have caller pass in a safely-obtained reference to the parent directory
for calculating a dentry's hash valud.

While we're here, simpify the flow through ceph_encode_fh() so that there
is a single exit point and cleanup.

Also fix a bug with the dentry hash calculation: calculate the hash for the
dentry we were given, not its parent.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

e5f86dc3

ceph: protect d_parent access in ceph_d_revalidate · bf1c6aca

Sage Weil authored Jul 26, 2011

Protect d_parent with d_lock.  Carry a reference.  Simplify the flow so
that there is a single exit point and cleanup.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

bf1c6aca

ceph: protect access to d_parent · 5f21c96d

Sage Weil authored Jul 26, 2011

d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode.  Also take a reference and drop it in the caller to avoid
a use-after-free.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

5f21c96d

ceph: handle racing calls to ceph_init_dentry · 48d0cbd1

Sage Weil authored Jul 26, 2011

The ->lookup() and prepopulate_readdir() callers are working with unhashed
dentries, so we don't have to worry.  The export.c callers, though, need
to initialize something they got back from d_obtain_alias() and are
potentially racing with other callers.  Make sure we don't return unless
the dentry is properly initialized (by us or someone else).
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

48d0cbd1

ceph: set dir complete frag after adding capability · dfabbed6

Sage Weil authored Jul 26, 2011

Curretly ceph_add_cap clears the complete bit if we are newly issued the
FILE_SHARED cap, which is normally the case for a newly issue cap on a new
directory.  That means we clear the just-set bit.  Move the check that sets
the flag to after the cap is added/updated.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

dfabbed6

rbd: set blk_queue request sizes to object size · 029bcbd8

Josh Durgin authored Jul 22, 2011

This improves performance since more requests can be merged.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>

029bcbd8

ceph: set up readahead size when rsize is not passed · e9852227

Yehuda Sadeh authored Jul 22, 2011

This should improve the default read performance, as without it
readahead is practically disabled.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>

e9852227

rbd: cancel watch request when releasing the device · 79e3057c

Yehuda Sadeh authored Jul 12, 2011

We were missing this cleanup, so when a device was released
the osd didn't clean up its watchers list, so following notifications
could be slow as osd needed to timeout on the client.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>

79e3057c

ceph: ignore lease mask · 2f90b852

Sage Weil authored Jul 26, 2011

The lease mask is no longer used (and it changed a while back).  Instead,
use a non-zero duration to indicate that there is a lease being issued.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

2f90b852

ceph: fix ceph_lookup_open intent usage · 468640e3

Sage Weil authored Jul 26, 2011

We weren't properly calling lookup_instantiate_filp when setting up the
lookup intent, which could lead to file leakage on errors.  So:

 - use separate helper for the hidden snapdir translation, immediately
   following the mds request
 - use ceph_finish_lookup for the final dentry/return value dance in the
   exit path
 - lookup_instantiate_filp on success
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

468640e3

ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC · 9bae113a

Sage Weil authored Jul 26, 2011

We only need to put these on the directory unsafe list if they have
side effects that fsync(2) should flush out.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

9bae113a

ceph: fix bad parent_inode calc in ceph_lookup_open · acda7657

Sage Weil authored Jul 26, 2011

We were always getting NULL here because the intent file f_dentry is always
NULL at this point, which means we were always passing NULL to
ceph_mdsc_do_request.  In reality, this was fine, since this isn't
currently ever a write operation that needs to get strung on the dir's
unsafe list.

Use the dir explicitly, and only pass it if this open has side-effects that
a dir fsync should flush.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

acda7657

ceph: avoid carrying Fw cap during write into page cache · d8de9ab6

Sage Weil authored Jul 26, 2011

The generic_file_aio_write call may block on balance_dirty_pages while we
flush data to the OSDs.  If we hold a reference to the FILE_WR cap during
that interval revocation by the MDS (e.g., to do a stat(2)) may be very
slow.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

d8de9ab6

libceph: don't time out osd requests that haven't been received · 4cf9d544

Sage Weil authored Jul 26, 2011

Keep track of when an outgoing message is ACKed (i.e., the server fully
received it and, presumably, queued it for processing). Time out OSD
requests only if it's been too long since they've been received.

This prevents timeouts and connection thrashing when the OSDs are simply
busy and are throttling the requests they read off the network.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

4cf9d544

ceph: report f_bfree based on kb_avail rather than diffing. · 8f04d422

Greg Farnum authored Jul 26, 2011

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>

8f04d422

ceph: only queue capsnap if caps are dirty · e77dc3e9

Sage Weil authored Jul 26, 2011

We used to go into this branch if i_wrbuffer_ref_head was non-zero.  This
was an ancient check from before we were careful about dealing with all
kinds of caps (and not just dirty pages).  It is cleaner to only queue a
capsnap if there is an actual dirty cap.  If we are racing with...
something...we will end up here with ci->i_wrbuffer_refs but no dirty
caps.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

e77dc3e9

ceph: fix snap writeback when racing with writes · af0ed569

Sage Weil authored Jul 26, 2011

There are two problems that come up when we try to queue a capsnap while a
write is in progress:

 - The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap
   with dirty == 0.  That will crash later in __ceph_flush_snaps().  Or
   on the FILE_WR cap if a write is in progress.
 - We may not have i_head_snapc set, which causes problems pretty quickly.
   Look to the snaprealm in this case.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

af0ed569

ceph: use flag bit for at_end readdir flag · 9cfa1098

Sage Weil authored Jul 26, 2011

This saves us a word of memory per file.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

9cfa1098

ceph: add F_SYNC file flag to force sync (non-O_DIRECT) io · 4918b6d1

Sage Weil authored Jul 26, 2011

This allows us to force IO through the sync path which you normally only
get when multiple clients are reading/writing to the same file or by
mounting with -o sync.  Among other things, this lets test programs verify
correctness with a single mount.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

4918b6d1

ceph: add flags field to file_info · 252c6728

Sage Weil authored Jul 26, 2011

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>

252c6728

22 Jul, 2011 2 commits
- Linux 3.0 · 02f8c6ae
  Linus Torvalds authored Jul 21, 2011
  
  02f8c6ae
- Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb · 1f922d07
  Linus Torvalds authored Jul 21, 2011
```
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
  sparc,kgdbts: fix compile regression with kgdb test suite
```
  1f922d07
21 Jul, 2011 9 commits

sparc,kgdbts: fix compile regression with kgdb test suite · 33d8881a

Jason Wessel authored Jul 19, 2011

Commit 63ab25eb (kgdbts: unify/generalize gdb breakpoint adjustment)
introduced a compile regression on sparc.

kgdbts.c: In function 'check_and_rewind_pc':
kgdbts.c:307: error: implicit declaration of function 'instruction_pointer_set'

Simply add the correct macro definition for instruction pointer on the
Sparc architecture.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David S. Miller <davem@davemloft.net>

33d8881a

Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 · 2bafc7a2
Linus Torvalds authored Jul 21, 2011
```
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  CIFS: Fix wrong length in cifs_iovec_read
```
2bafc7a2

Merge branch 'x86-urgent-for-linus' of... · 57a6fa9a

Linus Torvalds authored Jul 21, 2011

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Make Dell Latitude E6420 use reboot=pci
  x86: Make Dell Latitude E5420 use reboot=pci

57a6fa9a

x86: Make Dell Latitude E6420 use reboot=pci · a536877e

H. Peter Anvin authored Jul 21, 2011

Yet another variant of the Dell Latitude series which requires
reboot=pci.

From the E5420 bug report by Daniel J Blueman:

> The E6420 is affected also (same platform, different casing and
> features), which provides an external confirmation of the issue; I can
> submit a patch for that later or include it if you prefer:
> http://linux.koolsolutions.com/2009/08/04/howto-fix-linux-hangfreeze-during-reboots-and-restarts/Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@kernel.org>

a536877e

x86: Make Dell Latitude E5420 use reboot=pci · b7798d28

Daniel J Blueman authored May 13, 2011

Rebooting on the Dell E5420 often hangs with the keyboard or ACPI
methods, but is reliable via the PCI method.

[ hpa: this was deferred because we believed for a long time that the
  recent reshuffling of the boot priorities in commit
  660e34ce fixed this platform.
  Unfortunately that turned out to be incorrect. ]
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Link: http://lkml.kernel.org/r/1305248699-2347-1-git-send-email-daniel.blueman@gmail.comSigned-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@kernel.org>

b7798d28

Merge branch 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6 · ad21b115

Linus Torvalds authored Jul 21, 2011

* 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6:
  drm/i915: Fix unfenced alignment on pre-G33 hardware
  drm/i915: Add quirk to disable SSC on Lenovo U160 LVDS

ad21b115

vfs: drop conditional inode prefetch in __do_lookup_rcu · b91da88f

Linus Torvalds authored Jul 21, 2011

It seems to hurt performance in real life.  Yes, the inode will be used
later, but the conditional doesn't seem to predict all that well
(negative dentries are not uncommon) and it looks like the cost of
prefetching is simply higher than depending on the cache doing the right
thing.

As usual.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b91da88f

FS-Cache: Fix __fscache_uncache_all_inode_pages()'s outer loop · b307d465

Jan Beulich authored Jul 21, 2011

The compiler, at least for ix86 and m68k, validly warns that the
comparison:

	next <= (loff_t)-1

is always true (and it's always true also for x86-64 and probably all
other arches - as long as pgoff_t isn't wider than loff_t).  The
intention appears to be to avoid wrapping of "next", so rather than
eliminating the pointless comparison, fix the loop to indeed get exited
when "next" would otherwise wrap.

On m68k the following warning is observed:

  fs/fscache/page.c: In function '__fscache_uncache_all_inode_pages':
  fs/fscache/page.c:979: warning: comparison is always false due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Suresh Jayaraman <sjayaraman@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b307d465

CIFS: Fix wrong length in cifs_iovec_read · 2cebaa58

Pavel Shilovsky authored Jul 20, 2011

Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>

2cebaa58

20 Jul, 2011 6 commits

Merge branch 'core-urgent-for-linus' of... · cf6ace16

Linus Torvalds authored Jul 20, 2011

Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  signal: align __lock_task_sighand() irq disabling and RCU
  softirq,rcu: Inform RCU of irq_exit() activity
  sched: Add irq_{enter,exit}() to scheduler_ipi()
  rcu: protect __rcu_read_unlock() against scheduler-using irq handlers
  rcu: Streamline code produced by __rcu_read_unlock()
  rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special
  rcu: decrease rcu_report_exp_rnp coupling with scheduler

cf6ace16

Merge branch 'sched-urgent-for-linus' of... · acc11eab

Linus Torvalds authored Jul 20, 2011

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Avoid creating superfluous NUMA domains on non-NUMA systems
  sched: Allow for overlapping sched_domain spans
  sched: Break out cpu_power from the sched_group structure

acc11eab

Merge branch 'x86-urgent-for-linus' of... · 919d25a7

Linus Torvalds authored Jul 20, 2011

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86. reboot: Make Dell Latitude E6320 use reboot=pci
  x86, doc only: Correct real-mode kernel header offset for init_size
  x86: Disable AMD_NUMA for 32bit for now

919d25a7

Merge branch 'rcu/urgent' of... · d1e9ae47

Ingo Molnar authored Jul 20, 2011

Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent

d1e9ae47

signal: align __lock_task_sighand() irq disabling and RCU · a841796f

Paul E. McKenney authored Jul 19, 2011

The __lock_task_sighand() function calls rcu_read_lock() with interrupts
and preemption enabled, but later calls rcu_read_unlock() with interrupts
disabled. It is therefore possible that this RCU read-side critical
section will be preempted and later RCU priority boosted, which means that
rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but
with interrupts disabled. This results in lockdep splats, so this commit
nests the RCU read-side critical section within the interrupt-disabled
region of code. This prevents the RCU read-side critical section from
being preempted, and thus prevents the attempt to deboost with interrupts
disabled.

It is quite possible that a better long-term fix is to make rt_mutex_unlock()
disable irqs when acquiring the rt_mutex structure's ->wait_lock.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

a841796f

softirq,rcu: Inform RCU of irq_exit() activity · ec433f0c

Peter Zijlstra authored Jul 19, 2011

The rcu_read_unlock_special() function relies on in_irq() to exclude
scheduler activity from interrupt level.  This fails because exit_irq()
can invoke the scheduler after clearing the preempt_count() bits that
in_irq() uses to determine that it is at interrupt level.  This situation
can result in failures as follows:

 $task			IRQ		SoftIRQ

 rcu_read_lock()

 /* do stuff */

 <preempt> |= UNLOCK_BLOCKED

 rcu_read_unlock()
   --t->rcu_read_lock_nesting

			irq_enter();
			/* do stuff, don't use RCU */
			irq_exit();
			  sub_preempt_count(IRQ_EXIT_OFFSET);
			  invoke_softirq()

					ttwu();
					  spin_lock_irq(&pi->lock)
					  rcu_read_lock();
					  /* do stuff */
					  rcu_read_unlock();
					    rcu_read_unlock_special()
					      rcu_report_exp_rnp()
					        ttwu()
					          spin_lock_irq(&pi->lock) /* deadlock */

   rcu_read_unlock_special(t);

Ed can simply trigger this 'easy' because invoke_softirq() immediately
does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff
first, but even without that the above happens.

Cure this by also excluding softirqs from the
rcu_read_unlock_special() handler and ensuring the force_irqthreads
ksoftirqd/# wakeup is done from full softirq context.

[ Alternatively, delaying the ->rcu_read_lock_nesting decrement
  until after the special handling would make the thing more robust
  in the face of interrupts as well.  And there is a separate patch
  for that. ]

Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-tested-by: Ed Tomlinson <edt@aei.ca>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

ec433f0c