Commits · 053f5b6525ae32da397e6c47721961f800d2c924 · Kirill Smelkov / linux

10 Jun, 2014 1 commit

raid5: speedup sync_request processing · 053f5b65

Eivind Sarto authored Jun 09, 2014

The raid5 sync_request() processing calls handle_stripe() within the context of
the resync-thread.  The resync-thread issues the first set of read requests
and this adds execution latency and slows down the scheduling of the next
sync_request().
The current rebuild/resync speed of raid5 is not much faster than what
rotational HDDs can sustain.
Testing the following patch on a 6-drive array, I can increase the rebuild
speed from 100 MB/s to 175 MB/s.
The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
creates some more parallelism between the resync-thread and raid5 kernel daemon.
Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

053f5b65

05 Jun, 2014 1 commit

md/raid5: deadlock between retry_aligned_read with barrier io · 2844dc32

hui jiao authored Jun 05, 2014

A chunk aligned read increases counter active_aligned_reads and
decreases it after sub-device handle it successfully. But when a read
error occurs,  the read redispatched by raid5d, and the
active_aligned_reads will not be decreased until we can grab a stripe
head in retry_aligned_read. Now suppose, a barrier io comes, set
conf->quiesce to 2, and wait until both active_stripes and
active_aligned_reads are zero. The retried chunk aligned read gets
stuck at get_active_stripe waiting until conf->quiesce becomes 0.
Retry_aligned_read and barrier io are waiting each other now.
One possible solution is that we ignore conf->quiesce, let the retried
aligned read finish. I reproduced this deadlock and test this patch on
centos6.0
Signed-off-by: NeilBrown <neilb@suse.de>

2844dc32

29 May, 2014 7 commits

raid5: add an option to avoid copy data from bio to stripe cache · d592a996

Shaohua Li authored May 21, 2014

The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.

In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.

So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.

Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.

Neilb:
  changed BUG_ON to WARN_ON
  Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

d592a996

md/bitmap: remove confusing code from filemap_get_page. · f2e06c58

NeilBrown authored May 28, 2014

file_page_index(store, 0) is *always* 0.
This is because the bitmap sb, at 256 bytes, is *always* less than
one page.
So subtracting it has no effect and the code should be removed.
Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>

f2e06c58

raid5: avoid release list until last reference of the stripe · cf170f3f

Eivind Sarto authored May 28, 2014

The (lockless) release_list reduces lock contention, but there is excessive
queueing and dequeuing of stripes on this list. A stripe will currently be
queued on the release_list with a stripe reference count > 1. This can cause
the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
without doing any other useful processing of the stripe. The are two cases
when the stripe can be put on the release_list multiple times before it is
actually handled by the kernel thread(s).
1) make_request() activates the stripe processing in 4k increments. When a
write request is large enough to span multiple chunks of a stripe_head, the
first 4k chunk adds the stripe to the plug list. The next 4k chunk that is
processed for the same stripe puts the stripe on the release_list with a
refcount=2. This can cause the kernel thread to process and decrement the
stripe before the stripe us unplugged, which again will put it back on the
release_list.
2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
refcount is set to the number of active IO (for each chunk). The stripe is
released as each IO complete, and can be queued and dequeued multiple times
on the release_list, until its refcount finally reached zero.

This simple patch will ensure a stripe is only queued on the release_list when
its refcount=1 and is ready to be handled by the kernel thread(s). I added some
instrumentation to raid5 and counted the number of times striped were queued on
the release_list for a variety of write IO sizes. Without this patch the number
of times stripes got queued on the release_list was 100-500% higher than with
the patch. The excess queuing will increase with the IO size. The patch also
improved throughput by 5-10%.
Signed-off-by: Eivind Sarto <esarto@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>

cf170f3f

md: md_clear_badblocks should return an error code on failure. · 8b32bf5e

NeilBrown authored May 28, 2014

Julia Lawall and coccinelle report that md_clear_badblocks always
returns 0, despite appearing to have an error path.
The error path really should return an error code.  ENOSPC is
reasonably appropriate.
Reported-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NeilBrown <neilb@suse.de>

8b32bf5e

md/raid56: Don't perform reads to support writes until stripe is ready. · 67f45548

NeilBrown authored May 28, 2014

If it is found that we need to pre-read some blocks before a write
can succeed, we normally set STRIPE_DELAYED and don't actually perform
the read until STRIPE_PREREAD_ACTIVE subsequently gets set.

However for a degraded RAID6 we currently perform the reads as soon
as we see that a write is pending.  This significantly hurts
throughput.

So:
 - when handle_stripe_dirtying find a block that it wants on a device
   that is failed, set STRIPE_DELAY, instead of doing nothing, and
 - when fetch_block detects that a read might be required to satisfy a
   write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
   and if we would actually need to read something to complete the write.

This also helps RAID5, though less often as RAID5 supports a
read-modify-write cycle.  For RAID5 the read is performed too early
only if the write is not a full 4K aligned write (i.e. no an
R5_OVERWRITE).

Also clean up a couple of horrible bits of formatting.
Reported-by: Patrik Horník <patrik@dsl.sk>
Signed-off-by: NeilBrown <neilb@suse.de>

67f45548

md: refuse to change shape of array if it is active but read-only · bd8839e0

NeilBrown authored May 28, 2014

read-only arrays should not be changed.  This includes changing
the level, layout, size, or number of devices.

So reject those changes for readonly arrays.
Signed-off-by: NeilBrown <neilb@suse.de>

bd8839e0

md: always set MD_RECOVERY_INTR when interrupting a reshape thread. · 2ac295a5

NeilBrown authored May 29, 2014

Commit 8313b8e5
   md: fix problem when adding device to read-only array with bitmap.

added a called to md_reap_sync_thread() which cause a reshape thread
to be interrupted (in particular, it could cause md_thread() to never even
call md_do_sync()).
However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
know that the reshape didn't complete.

This only happens when mddev->ro is set and normally reshape threads
don't run in that situation.  But raid5 and raid10 can start a reshape
thread during "run" is the array is in the middle of a reshape.
They do this even if ->ro is set.

So it is best to set MD_RECOVERY_INTR before abortingg the
sync thread, just in case.

Though it rare for this to trigger a problem it can cause data corruption
because the reshape isn't finished properly.
So it is suitable for any stable which the offending commit was applied to.
(3.2 or later)

Fixes: 8313b8e5
Cc: stable@vger.kernel.org (3.2+)
Signed-off-by: NeilBrown <neilb@suse.de>

2ac295a5

28 May, 2014 1 commit

md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync". · 3991b31e

NeilBrown authored May 28, 2014

If mddev->ro is set, md_to_sync will (correctly) abort.
However in that case MD_RECOVERY_INTR isn't set.

If a RESHAPE had been requested, then ->finish_reshape() will be
called and it will think the reshape was successful even though
nothing happened.

Normally a resync will not be requested if ->ro is set, but if an
array is stopped while a reshape is on-going, then when the array is
started, the reshape will be restarted.  If the array is also set
read-only at this point, the reshape will instantly appear to success,
resulting in data corruption.

Consequently, this patch is suitable for any -stable kernel.

Cc: stable@vger.kernel.org (any)
Signed-off-by: NeilBrown <neilb@suse.de>

3991b31e

25 May, 2014 13 commits

Linux 3.15-rc7 · c7208164
Linus Torvalds authored May 25, 2014

c7208164

Merge branch 'afs' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · db1003f2

Linus Torvalds authored May 25, 2014

Pull AFS fixes and cleanups from David Howells:
 "Here are some patches to the AFS filesystem:

  1) Fix problems in the clean-up parts of the cache manager service
     handler.

  2) Split afs_end_call() introduced in (1) and replace some identical
     code elsewhere with a call to the first half of the split function.

  3) Fix an error introduced in the workqueue PREPARE_WORK() elimination
     commits.

  4) Clean up argument passing to functions called from the workqueue as
     there's now an insulating layer between them and the workqueue.
     This is possible from (3)"

* 'afs' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  AFS: Pass an afs_call* to call->async_workfn() instead of a work_struct*
  AFS: Fix kafs module unloading
  AFS: Part of afs_end_call() is identical to code elsewhere, so split it
  AFS: Fix cache manager service handlers

db1003f2

Merge branch 'rdunlap' (patches from Randy Dunlap) · ef0d2c16

Linus Torvalds authored May 25, 2014

Merge documentation fixes from Randy Dunlap.

* emailed patches from Randy Dunlap <rdunlap@infradead.org>:
  Documentation: update /proc/stat "intr" count summary
  Documentation: update java sample wrapper for java 7
  Documentation: update thunderbird email client settings
  Documentation: fix typos in drm docbook

ef0d2c16

Documentation: update /proc/stat "intr" count summary · 3568a1db

Jan Moskyto Matejka authored May 15, 2014

The sum at the beginning of line "intr" includes also unnumbered
interrupts.  It implies that the sum at the beginning isn't the sum
of the remainder of the line, not even an estimation.

Fixed the documentation to mention that.

This behaviour was added to /proc/stat in commit a2eddfa9 ("x86:
make /proc/stat account for all interrupts")
Signed-off-by: Jan Moskyto Matejka <mq@suse.cz>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3568a1db

Documentation: update java sample wrapper for java 7 · f76f133b

Jonathan Callen authored May 15, 2014

The sample wrapper currently fails on some Java 7 .class files.  This
updates the wrapper to properly handle those files.
Signed-off-by: Jonathan Callen <jcallen@gentoo.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f76f133b

Documentation: update thunderbird email client settings · f9a0974d

Paul McQuade authored May 15, 2014

Added setting to email-clients that is easier to read and is easier to
setup thunderbird.  Removed config settings and added GUI settings.
Signed-off-by: Paul McQuade <paulmcquad@gmail.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f9a0974d

Documentation: fix typos in drm docbook · f9b899c4

Masanari Iida authored May 15, 2014

Fix spelling typo in DocBook/drm.tmpl
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f9b899c4

Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging · f016a644

Linus Torvalds authored May 25, 2014

Pull hwmon subsystem fixes from Jean Delvare.

* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  hwmon: (ntc_thermistor) Fix OF device ID mapping
  hwmon: (ntc_thermistor) Fix dependencies
  hwmon: Document temp[1-*]_min_hyst sysfs attribute

f016a644

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 0e37c275

Linus Torvalds authored May 25, 2014

Pull single scsi fix from James Bottomley:
 "This is a single fix for a bug exposed by a sysfs change in 3.13 which
  now causes libsas to trigger a warn on in device removal"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  [SCSI] scsi_transport_sas: move bsg destructor into sas_rphy_remove

0e37c275

Merge branch 'for-3.15' of git://linux-nfs.org/~bfields/linux · 80a1de29

Linus Torvalds authored May 25, 2014

Pull two nfsd bugfixes from Bruce Fields:
 "Just two bugfixes, one for a merge-window-introduced ACL regression,
  the other for a longer-standing v4 state bug"

* 'for-3.15' of git://linux-nfs.org/~bfields/linux:
  nfsd4: warn on finding lockowner without stateid's
  nfsd4: remove lockowner when removing lock stateid
  nfsd4: fix corruption on setting an ACL.

80a1de29

hwmon: (ntc_thermistor) Fix OF device ID mapping · ead82d67

Jean Delvare authored May 25, 2014

The mapping from OF device IDs to platform device IDs is wrong.
TYPE_NCPXXWB473 is 0, TYPE_NCPXXWL333 is 1, so
ntc_thermistor_id[TYPE_NCPXXWB473] is { "ncp15wb473", TYPE_NCPXXWB473 }
while
ntc_thermistor_id[TYPE_NCPXXWL333] is { "ncp18wb473", TYPE_NCPXXWB473 }.

So the name is wrong for all but the "ntc,ncp15wb473" entry, and the
type is wrong for the "ntc,ncp15wl333" entry.

So map the entries by index, it is neither elegant nor robust but at
least it is correct.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixes: 9e8269de hwmon: (ntc_thermistor) Add DT with IIO support to NTC thermistor driver
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Cc: Naveen Krishna Chatradhi <ch.naveen@samsung.com>
Cc: Doug Anderson <dianders@chromium.org>

ead82d67

hwmon: (ntc_thermistor) Fix dependencies · 59cf4243

Jean Delvare authored May 25, 2014

In commit 9e8269de, support was added for ntc_thermistor devices being
declared in the device tree and implemented on top of IIO. With that
change, a dependency was added to the ntc_thermistor driver:

	depends on (!OF && !IIO) || (OF && IIO)

This construct has the drawback that the driver can no longer be
selected when OF is set and IIO isn't, nor when IIO is set and OF is
not. This is a regression for the original users of the driver.

As the new code depends on IIO and is useless without OF, include it
only if both are enabled, and set the dependencies accordingly. This
is clearer, more simple and more correct.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixes: 9e8269de hwmon: (ntc_thermistor) Add DT with IIO support to NTC thermistor driver
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Cc: Naveen Krishna Chatradhi <ch.naveen@samsung.com>
Cc: Doug Anderson <dianders@chromium.org>

59cf4243

hwmon: Document temp[1-*]_min_hyst sysfs attribute · 01325145

Jean Delvare authored May 25, 2014

The temp[1-*]_min_hyst sysfs attribute is already implemented by 3
hwmon drivers (adt7x10, lm77 and lm92) but was missing from the
standard interface.

Also add temp[1-*]_lcrit_hyst for consistency, even though no driver
implement that one for the time being.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>

01325145

23 May, 2014 17 commits

Merge tag 'dmaengine-fixes-3.15-rc5' of... · 03743007

Linus Torvalds authored May 23, 2014

Merge tag 'dmaengine-fixes-3.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine

Pull dmaengine fixes from Dan Williams:
 "Two fixes for -stable:

   - async_mult() sometimes maps less buffers than initially requested.
      We end up freeing dmaengine_unmap_data on an invalid pool.

   - mv_xor: register write ordering fix"

* tag 'dmaengine-fixes-3.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
  dmaengine: fix dmaengine_unmap failure
  dma: mv_xor: Flush descriptors before activating a channel

03743007

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · 1ee1ceaf

Linus Torvalds authored May 23, 2014

Pull sparc fixes from David Miller:
 "A small bunch of bug fixes, in particular:

   1) On older cpus we need a different chunk of virtual address space
      to map the huge page TSB.

   2) Missing memory barrier in Niagara2 memcpy.

   3) trinity showed some places where fault validation was
      unnecessarily loud on sparc64

   4) Some sysfs printf's need a type adjustment, from Toralf Förster"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: fix format string mismatch in arch/sparc/kernel/sysfs.c
  sparc64: Add membar to Niagara2 memcpy code.
  sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus.
  sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses.

1ee1ceaf

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 5fa6a683

Linus Torvalds authored May 23, 2014

Pull networking fixes from David Miller:
 "It looks like a sizeble collection but this is nearly 3 weeks of bug
  fixing while you were away.

   1) Fix crashes over IPSEC tunnels with NAT, the latter can reroute
      the packet through a non-IPSEC protected path and the code has to
      be able to handle SKBs attached to routes lacking an attached xfrm
      state.  From Steffen Klassert.

   2) Fix OOPSs in ipv4 and ipv6 ipsec layers for unsupported
      sub-protocols, also from Steffen Klassert.

   3) Set local_df on fragmented netfilter skbs otherwise we won't be
      able to forward successfully, from Florian Westphal.

   4) cdc_mbim ipv6 neighbour code does __vlan_find_dev_deep without
      holding RCU lock, from Bjorn Mork.

   5) local_df test in ip_may_fragment is inverted, from Florian
      Westphal.

   6) jme driver doesn't check for DMA mapping failures, from Neil
      Horman.

   7) qlogic driver doesn't calculate number of TX queues properly, from
      Shahed Shaikh.

   8) fib_info_cnt can drift irreversibly positive if we fail to
      allocate the fi->fib_metrics array, from Sergey Popovich.

   9) Fix use after free in ip6_route_me_harder(), also from Sergey
      Popovich.

  10) When SYSCTL is disabled, we don't handle local_port_range and
      ping_group_range defaults properly at all, from Cong Wang.

  11) Unaccelerated VLAN tagged frames improperly handled by cdc_mbim
      driver, fix from Bjorn Mork.

  12) cassini driver needs nested lock annotations for TX locking, from
      Emil Goode.

  13) On init error ipv6 VTI driver can unregister pernet ops twice,
      oops.  Fix from Mahtias Krause.

  14) If macvlan device is down, don't propagate IFF_ALLMULTI changes,
      from Peter Christensen.

  15) Missing NULL pointer check while parsing netlink config options in
      ip6_tnl_validate().  From Susant Sahani.

  16) Fix handling of neighbour entries during ipv6 router reachability
      probing, from Duan Jiong.

  17) x86 and s390 JIT address randomization has some address
      calculation bugs leading to crashes, from Alexei Starovoitov and
      Heiko Carstens.

  18) Clear up those uglies with nop patching and net_get_random_once(),
      from Hannes Frederic Sowa.

  19) Option length miscalculated in ip6_append_data(), fix also from
      Hannes Frederic Sowa.

  20) A while ago we fixed a race during device unregistry when a
      namespace went down, turns out there is a second place that needs
      similar protection.  From Cong Wang.

  21) In the new Altera TSE driver multicast filtering isn't working,
      disable it and just use promisc mode until the cause is found.
      From Vince Bridgers.

  22) When we disable router enabling in ipv6 we have to flush the
      cached routes explicitly, from Duan Jiong.

  23) NBMA tunnels should not cache routes on the tunnel object because
      the key is variable, from Timo Teräs.

  24) With stacked devices GRO information in skb->cb[] can be not setup
      properly, make sure it is in all code paths.  From Eric Dumazet.

  25) Really fix stacked vlan locking, multiple levels of nesting with
      intervening non-vlan devices are possible.  From Vlad Yasevich.

  26) Fallback ipip tunnel device's mtu is not setup properly, from
      Steffen Klassert.

  27) The packet scheduler's tcindex filter can crash because we
      structure copy objects with list_head's inside, oops.  From Cong
      Wang.

  28) Fix CHECKSUM_COMPLETE handling for ipv6 GRE tunnels, from Eric
      Dumazet.

  29) In some configurations 'itag' in __mkroute_input() can end up
      being used uninitialized because of how fib_validate_source()
      works.  Fix it by explitly initializing itag to zero like all the
      other fib_validate_source() callers do, from Li RongQing"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
  batman: fix a bogus warning from batadv_is_on_batman_iface()
  ipv4: initialise the itag variable in __mkroute_input
  bonding: Send ALB learning packets using the right source
  bonding: Don't assume 802.1Q when sending alb learning packets.
  net: doc: Update references to skb->rxhash
  stmmac: Remove unbalanced clk_disable call
  ipv6: gro: fix CHECKSUM_COMPLETE support
  net_sched: fix an oops in tcindex filter
  can: peak_pci: prevent use after free at netdev removal
  ip_tunnel: Initialize the fallback device properly
  vlan: Fix build error wth vlan_get_encap_level()
  can: c_can: remove obsolete STRICT_FRAME_ORDERING Kconfig option
  MAINTAINERS: Pravin Shelar is Open vSwitch maintainer.
  bnx2x: Convert return 0 to return rc
  bonding: Fix alb mode to only use first level vlans.
  bonding: Fix stacked device detection in arp monitoring
  macvlan: Fix lockdep warnings with stacked macvlan devices
  vlan: Fix lockdep warning with stacked vlan devices.
  net: Allow for more then a single subclass for netif_addr_lock
  net: Find the nesting level of a given device by type.
  ...

5fa6a683

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f02f79db

Linus Torvalds authored May 23, 2014

Pull scheduler fixes from Ingo Molnar:
 "The biggest commit is an irqtime accounting loop latency fix, the rest
  are misc fixes all over the place: deadline scheduling, docs, numa,
  balancer and a bad to-idle latency fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Initialize newidle balance stats in sd_numa_init()
  sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance()
  sched: Skip double execution of pick_next_task_fair()
  sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check
  sched/deadline: Fix memory leak
  sched/deadline: Fix sched_yield() behavior
  sched: Sanitize irq accounting madness
  sched/docbook: Fix 'make htmldocs' warnings caused by missing description

f02f79db

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e6a32c3a

Linus Torvalds authored May 23, 2014

Pull perf fixes from Ingo Molnar:
 "The biggest changes are fixes for races that kept triggering Trinity
  crashes, plus liblockdep build fixes and smaller misc fixes.

  The liblockdep bits in perf/urgent are a pull mistake - they should
  have been in locking/urgent - but by the time I noticed other commits
  were added and testing was done :-/ Sorry about that"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
  perf: Prevent false warning in perf_swevent_add
  perf: Limit perf_event_attr::sample_period to 63 bits
  tools/liblockdep: Remove all build files when doing make clean
  tools/liblockdep: Build liblockdep from tools/Makefile
  perf/x86/intel: Fix Silvermont's event constraints
  perf: Fix perf_event_init_context()
  perf: Fix race in removing an event

e6a32c3a

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · 2b2d323a

Linus Torvalds authored May 23, 2014

Pull drm radeon and nouveau fixes from Dave Airlie:
 "Fixes for the other big two.

  The radeon VCE one is large but it fixes some userspace triggerable
  issues, otherwise its blackscreens and oopses.

  Nouveau fixes a bleeding laptop panel issue when displayport is used
  sometimes"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/radeon/pm: don't allow debugfs/sysfs access when PX card is off (v2)
  drm/radeon: avoid segfault on device open when accel is not working.
  drm/radeon: fix typo in finding PLL params
  drm/radeon: fix register typo on si
  drm/radeon: fix buffer placement under memory pressure v2
  drm/radeon: fix page directory update size estimation
  drm/radeon: handle non-VGA class pci devices with ATRM
  drm/radeon: fix DCE83 check for mullins
  drm/radeon: check VCE relocation buffer range v3
  drm/radeon: also try GART for CPU accessed buffers
  drm/gf119-/disp: fix nasty bug which can clobber SOR0's clock setup
  drm/nvd9/therm: handle another kind of PWM fan

2b2d323a

Merge branch 'akpm' (incoming from Andrew) · fc3ac5c7

Linus Torvalds authored May 23, 2014

Merge misc fixes from Andrew Morton:
 "9 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  MAINTAINERS: add closing angle bracket to Vince Bridgers' email address
  Documentation: fix DOCBOOKS=... building
  ocfs2: fix double kmem_cache_destroy in dlm_init
  mm/memory-failure.c: fix memory leak by race between poison and unpoison
  wait: swap EXIT_ZOMBIE(Z) and EXIT_DEAD(X) chars in TASK_STATE_TO_CHAR_STR
  memcg: fix swapcache charge from kernel thread context
  mm: madvise: fix MADV_WILLNEED on shmem swapouts
  mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
  hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage

fc3ac5c7

MAINTAINERS: add closing angle bracket to Vince Bridgers' email address · 0d9327ab

Tobias Klauser authored May 22, 2014

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Vince Bridgers <vbridgers2013@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0d9327ab

Documentation: fix DOCBOOKS=... building · e60cbeed

Johannes Berg authored May 22, 2014

Prior to commit 42661299 ("[media] DocBook: Move all media docbook
stuff into its own directory") it was possible to build only a single
(or more) book(s) by calling, for example

    make htmldocs DOCBOOKS=80211.xml

This now fails:

    cp: target `.../Documentation/DocBook//media_api' is not a directory

Ignore errors from that copy to make this possible again.

Fixes: 42661299 ("[media] DocBook: Move all media docbook stuff into its own directory")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e60cbeed

ocfs2: fix double kmem_cache_destroy in dlm_init · 66db6cfd

Joseph Qi authored May 22, 2014

In dlm_init, if create dlm_lockname_cache failed in
dlm_init_master_caches, it will destroy dlm_lockres_cache which created
before twice.  And this will cause system die when loading modules.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

66db6cfd

mm/memory-failure.c: fix memory leak by race between poison and unpoison · 3e030ecc

Naoya Horiguchi authored May 22, 2014

When a memory error happens on an in-use page or (free and in-use)
hugepage, the victim page is isolated with its refcount set to one.

When you try to unpoison it later, unpoison_memory() calls put_page()
for it twice in order to bring the page back to free page pool (buddy or
free hugepage list).  However, if another memory error occurs on the
page which we are unpoisoning, memory_failure() returns without
releasing the refcount which was incremented in the same call at first,
which results in memory leak and unconsistent num_poisoned_pages
statistics.  This patch fixes it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>    [2.6.32+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3e030ecc

wait: swap EXIT_ZOMBIE(Z) and EXIT_DEAD(X) chars in TASK_STATE_TO_CHAR_STR · ad0f614e

Masatake YAMATO authored May 22, 2014

In commit ad86622b ("wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide
EXIT_TRACE from user-space") the order of task state definitions were
changed: EXIT_DEAD and EXIT_ZOMBIE were swapped.  Though the charterers
for the states in TASK_STATE_TO_CHAR_STR string were not updated.  This
patch synchronizes the string to the order of definitions.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ad0f614e

memcg: fix swapcache charge from kernel thread context · 6f6acb00

Michal Hocko authored May 22, 2014

Commit 284f39af ("mm: memcg: push !mm handling out to page cache
charge function") explicitly checks for page cache charges without any
mm context (from kernel thread context[1]).

This seemed to be the only possible case where memory could be charged
without mm context so commit 03583f1a ("memcg: remove unnecessary
!mm check from try_get_mem_cgroup_from_mm()") removed the mm check from
get_mem_cgroup_from_mm().  This however caused another NULL ptr
dereference during early boot when loopback kernel thread splices to
tmpfs as reported by Stephan Kulow:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000360
  IP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
  Oops: 0000 [#1] SMP
  Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop ppa]
  CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Call Trace:
    __mem_cgroup_try_charge_swapin+0x40/0xe0
    mem_cgroup_charge_file+0x8b/0xd0
    shmem_getpage_gfp+0x66b/0x7b0
    shmem_file_splice_read+0x18f/0x430
    splice_direct_to_actor+0xa2/0x1c0
    do_lo_receive+0x5a/0x60 [loop]
    loop_thread+0x298/0x720 [loop]
    kthread+0xc6/0xe0
    ret_from_fork+0x7c/0xb0

Also Branimir Maksimovic reported the following oops which is tiggered
for the swapcache charge path from the accounting code for kernel threads:

  CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P           OE 3.15.0-rc5-core2-custom #159
  Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013
  task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000
  RIP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
  Call Trace:
    __mem_cgroup_try_charge_swapin+0x45/0xf0
    mem_cgroup_charge_file+0x9c/0xe0
    shmem_getpage_gfp+0x62c/0x770
    shmem_write_begin+0x38/0x40
    generic_perform_write+0xc5/0x1c0
    __generic_file_aio_write+0x1d1/0x3f0
    generic_file_aio_write+0x4f/0xc0
    do_sync_write+0x5a/0x90
    do_acct_process+0x4b1/0x550
    acct_process+0x6d/0xa0
    do_exit+0x827/0xa70
    kthread+0xc3/0xf0

This patch fixes the issue by reintroducing mm check into
get_mem_cgroup_from_mm.  We could do the same trick in
__mem_cgroup_try_charge_swapin as we do for the regular page cache path
but it is not worth troubles.  The check is not that expensive and it is
better to have get_mem_cgroup_from_mm more robust.

[1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2

Fixes: 03583f1a ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()")
Reported-and-tested-by: Stephan Kulow <coolo@suse.com>
Reported-by: Branimir Maksimovic <branimir.maksimovic@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6f6acb00

mm: madvise: fix MADV_WILLNEED on shmem swapouts · 55231e5c

Johannes Weiner authored May 22, 2014

MADV_WILLNEED currently does not read swapped out shmem pages back in.

Commit 0cd6144a ("mm + fs: prepare for non-page entries in page
cache radix trees") made find_get_page() filter exceptional radix tree
entries but failed to convert all find_get_page() callers that WANT
exceptional entries over to find_get_entry(). One of them is shmem swap
readahead in madvise, which now skips over any swap-out records.

Convert it to find_get_entry().

Fixes: 0cd6144a ("mm + fs: prepare for non-page entries in page cache radix trees")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

55231e5c

mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT · 7fcbbaf1

Jens Axboe authored May 22, 2014

In some testing I ran today (some fio jobs that spread over two nodes),
we end up spending 40% of the time in filemap_check_errors().  That
smells fishy.  Looking further, this is basically what happens:

blkdev_aio_read()
    generic_file_aio_read()
        filemap_write_and_wait_range()
            if (!mapping->nr_pages)
                filemap_check_errors()

and filemap_check_errors() always attempts two test_and_clear_bit() on
the mapping flags, thus dirtying it for every single invocation.  The
patch below tests each of these bits before clearing them, avoiding this
issue.  In my test case (4-socket box), performance went from 1.7M IOPS
to 4.0M IOPS.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7fcbbaf1

hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage · b985194c

Chen Yucong authored May 22, 2014

For handling a free hugepage in memory failure, the race will happen if
another thread hwpoisoned this hugepage concurrently.  So we need to
check PageHWPoison instead of !PageHWPoison.

If hwpoison_filter(p) returns true or a race happens, then we need to
unlock_page(hpage).
Signed-off-by: Chen Yucong <slaoub@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org>	[2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b985194c

parisc: 'renameat2()' doesn't need (or have) a separate compat system call · 9abd09ac

Linus Torvalds authored May 23, 2014

The 'renameat2()' system call was incorrectly added as a ENTRY_COMP() in
the parisc system call table by commit 18e480aa ("parisc: add
renameat2 syscall").  That causes a link-time error due to there not
being any compat version of that system call:

  arch/parisc/kernel/built-in.o: In function `sys_call_table':
  (.rodata+0xad0): undefined reference to `compat_sys_renameat2'
  make: *** [vmlinux] Error 1

Easily fixed by marking the system call as being the same for compat as
for native by using ENTRY_SAME() instead of ENTRY_COMP().
Reported-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9abd09ac