1. 03 Apr, 2014 40 commits
    • Johannes Weiner's avatar
      mm: vmstat: fix UP zone state accounting · 6a3ed212
      Johannes Weiner authored
      Summary:
      
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently used list
      "inactive list" and the frequently used list "active list".
      
      Currently, the VM aims for a 1:1 ratio between the lists, which is the
      "perfect" trade-off between the ability to *protect* frequently used
      pages and the ability to *detect* frequently used pages.  This means
      that working set changes bigger than half of cache memory go undetected
      and thrash indefinitely, whereas working sets bigger than half of cache
      memory are unprotected against used-once streams that don't even need
      caching.
      
      This happens on file servers and media streaming servers, where the
      popular files and file sections change over time.  Even though the
      individual files might be smaller than half of memory, concurrent access
      to many of them may still result in their inter-reference distance being
      greater than half of memory.  It's also been reported as a problem on
      database workloads that switch back and forth between tables that are
      bigger than half of memory.  In these cases the VM never recognizes the
      new working set and will for the remainder of the workload thrash disk
      data which could easily live in memory.
      
      Historically, every reclaim scan of the inactive list also took a
      smaller number of pages from the tail of the active list and moved them
      to the head of the inactive list.  This model gave established working
      sets more gracetime in the face of temporary use-once streams, but
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This series solves the problem by maintaining a history of pages evicted
      from the inactive list, enabling the VM to detect frequently used pages
      regardless of inactive list size and facilitate working set transitions.
      
      Tests:
      
      The reported database workload is easily demonstrated on a 8G machine
      with two filesets a 6G.  This fio workload operates on one set first,
      then switches to the other.  The VM should obviously always cache the
      set that the workload is currently using.
      
      This test is based on a problem encountered by Citus Data customers:
        http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data
      
      unpatched:
        db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
        db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
        sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%
      
        real    27m15.541s
        user    0m19.059s
        sys     0m51.459s
      
      patched:
        db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
        db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
        sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%
      
        real    6m8.630s
        user    0m14.714s
        sys     0m31.233s
      
      As can be seen, the unpatched kernel simply never adapts to the
      workingset change and db2 is stuck indefinitely with secondary storage
      speed.  The patched kernel needs 2-3 iterations over db2 before it
      replaces db1 and reaches full memory speed.  Given the unbounded
      negative affect of the existing VM behavior, these patches should be
      considered correctness fixes rather than performance optimizations.
      
      Another test resembles a fileserver or streaming server workload, where
      data in excess of memory size is accessed at different frequencies.
      There is very hot data accessed at a high frequency.  Machines should be
      fitted so that the hot set of such a workload can be fully cached or all
      bets are off.  Then there is a very big (compared to available memory)
      set of data that is used-once or at a very low frequency; this is what
      drives the inactive list and does not really benefit from caching.
      Lastly, there is a big set of warm data in between that is accessed at
      medium frequencies and benefits from caching the pages between the first
      and last streamer of each burst.
      
      unpatched:
         hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
        warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
        cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
         sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00%
      
      patched:
         hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
        warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
        cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
         sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18%
      
      In both kernels, the hot set is propagated to the active list and then
      served from cache.
      
      In both kernels, the beginning of the warm set is propagated to the
      active list as well, but in the unpatched case the active list
      eventually takes up half of memory and no new pages from the warm set
      get activated, despite repeated access, and despite most of the active
      list soon being stale.  The patched kernel on the other hand detects the
      thrashing and manages to keep this cache window rolling through the data
      set.  This frees up enough IO bandwidth that the cold set is served at
      full speed as well and disk utilization even drops by 20%.
      
      For reference, this same test was performed with the traditional
      demotion mechanism, where deactivation is coupled to inactive list
      reclaim.  However, this had the same outcome as the unpatched kernel:
      while the warm set does indeed get activated continuously, it is forced
      out of the active list by inactive list pressure, which is dictated
      primarily by the unrelated cold set.  The warm set is evicted before
      subsequent streamers can benefit from it, even though there would be
      enough space available to cache the pages of interest.
      
      Costs:
      
      Page reclaim used to shrink the radix trees but now the tree nodes are
      reused for shadow entries, where the cost depends heavily on the page
      cache access patterns.  However, with workloads that maintain spatial or
      temporal locality, the shadow entries are either refaulted quickly or
      reclaimed along with the inode object itself.  Workloads that will
      experience a memory cost increase are those that don't really benefit
      from caching in the first place.
      
      A more predictable alternative would be a fixed-cost separate pool of
      shadow entries, but this would incur relatively higher memory cost for
      well-behaved workloads at the benefit of cornercases.  It would also
      make the shadow entry lookup more costly compared to storing them
      directly in the cache structure.
      
      Future:
      
      To simplify the merging process, this patch set is implementing thrash
      detection on a global per-zone level only for now, but the design is
      such that it can be extended to memory cgroups as well.  All we need to
      do is store the unique cgroup ID along the node and zone identifier
      inside the eviction cookie to identify the lruvec.
      
      Right now we have a fixed ratio (50:50) between inactive and active list
      but we already have complaints about working sets exceeding half of
      memory being pushed out of the cache by simple streaming in the
      background.  Ultimately, we want to adjust this ratio and allow for a
      much smaller inactive list.  These patches are an essential step in this
      direction because they decouple the VMs ability to detect working set
      changes from the inactive list size.  This would allow us to base the
      inactive list size on the combined readahead window size for example and
      potentially protect a much bigger working set.
      
      It's also a big step towards activating pages with a reuse distance
      larger than memory, as long as they are the most frequently used pages
      in the workload.  This will require knowing more about the access
      frequency of active pages than what we measure right now, so it's also
      deferred in this series.
      
      Another possibility of having thrashing information would be to revisit
      the idea of local reclaim in the form of zero-config memory control
      groups.  Instead of having allocating tasks go straight to global
      reclaim, they could try to reclaim the pages in the memcg they are part
      of first as long as the group is not thrashing.  This would allow a user
      to drop e.g.  a back-up job in an otherwise unconfigured memcg and it
      would only inflate (and possibly do global reclaim) until it has enough
      memory to do proper readahead.  But once it reaches that point and stops
      thrashing it would just recycle its own used-once pages without kicking
      out the cache of any other tasks in the system more than necessary.
      
      This patch (of 10):
      
      Fengguang Wu's build testing spotted problems with inc_zone_state() and
      dec_zone_state() on UP configurations in out-of-tree patches.
      
      inc_zone_state() is declared but not defined, dec_zone_state() is
      missing entirely.
      
      Just like with *_zone_page_state(), they can be defined like their
      preemption-unsafe counterparts on UP.
      
      [akpm@linux-foundation.org: make it build]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a3ed212
    • Vladimir Davydov's avatar
      mm: vmscan: shrink_slab: rename max_pass -> freeable · d5bc5fd3
      Vladimir Davydov authored
      The name `max_pass' is misleading, because this variable actually keeps
      the estimate number of freeable objects, not the maximal number of
      objects we can scan in this pass, which can be twice that.  Rename it to
      reflect its actual meaning.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5bc5fd3
    • Davidlohr Bueso's avatar
      mm, hugetlb: improve page-fault scalability · 8382d914
      Davidlohr Bueso authored
      The kernel can currently only handle a single hugetlb page fault at a
      time.  This is due to a single mutex that serializes the entire path.
      This lock protects from spurious OOM errors under conditions of low
      availability of free hugepages.  This problem is specific to hugepages,
      because it is normal to want to use every single hugepage in the system
      - with normal pages we simply assume there will always be a few spare
      pages which can be used temporarily until the race is resolved.
      
      Address this problem by using a table of mutexes, allowing a better
      chance of parallelization, where each hugepage is individually
      serialized.  The hash key is selected depending on the mapping type.
      For shared ones it consists of the address space and file offset being
      faulted; while for private ones the mm and virtual address are used.
      The size of the table is selected based on a compromise of collisions
      and memory footprint of a series of database workloads.
      
      Large database workloads that make heavy use of hugepages can be
      particularly exposed to this issue, causing start-up times to be
      painfully slow.  This patch reduces the startup time of a 10 Gb Oracle
      DB (with ~5000 faults) from 37.5 secs to 25.7 secs.  Larger workloads
      will naturally benefit even more.
      
      NOTE:
      The only downside to this patch, detected by Joonsoo Kim, is that a
      small race is possible in private mappings: A child process (with its
      own mm, after cow) can instantiate a page that is already being handled
      by the parent in a cow fault.  When low on pages, can trigger spurious
      OOMs.  I have not been able to think of a efficient way of handling
      this...  but do we really care about such a tiny window? We already
      maintain another theoretical race with normal pages.  If not, one
      possible way to is to maintain the single hash for private mappings --
      any workloads that *really* suffer from this scaling problem should
      already use shared mappings.
      
      [akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8382d914
    • Joonsoo Kim's avatar
      mm, hugetlb: use vma_resv_map() map types · 4e35f483
      Joonsoo Kim authored
      Util now, we get a resv_map by two ways according to each mapping type.
      This makes code dirty and unreadable.  Unify it.
      
      [davidlohr@hp.com: code cleanups]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e35f483
    • Joonsoo Kim's avatar
      mm, hugetlb: remove resv_map_put · f031dd27
      Joonsoo Kim authored
      This is a preparation patch to unify the use of vma_resv_map()
      regardless of the map type.  This patch prepares it by removing
      resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
      for all resv_maps.
      
      [davidlohr@hp.com: update changelog]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f031dd27
    • Davidlohr Bueso's avatar
      mm, hugetlb: fix race in region tracking · 7b24d861
      Davidlohr Bueso authored
      There is a race condition if we map a same file on different processes.
      Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
      When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
      mmap_sem (exclusively).  This doesn't prevent other tasks from modifying
      the region structure, so it can be modified by two processes
      concurrently.
      
      To solve this, introduce a spinlock to resv_map and make region
      manipulation function grab it before they do actual work.
      
      [davidlohr@hp.com: updated changelog]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b24d861
    • Joonsoo Kim's avatar
      mm, hugetlb: improve, cleanup resv_map parameters · 1406ec9b
      Joonsoo Kim authored
      To change a protection method for region tracking to find grained one,
      we pass the resv_map, instead of list_head, to region manipulation
      functions.
      
      This doesn't introduce any functional change, and it is just for
      preparing a next step.
      
      [davidlohr@hp.com: update changelog]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1406ec9b
    • Joonsoo Kim's avatar
      mm, hugetlb: unify region structure handling · 9119a41e
      Joonsoo Kim authored
      Currently, to track reserved and allocated regions, we use two different
      ways, depending on the mapping.  For MAP_SHARED, we use
      address_mapping's private_list and, while for MAP_PRIVATE, we use a
      resv_map.
      
      Now, we are preparing to change a coarse grained lock which protect a
      region structure to fine grained lock, and this difference hinder it.
      So, before changing it, unify region structure handling, consistently
      using a resv_map regardless of the kind of mapping.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9119a41e
    • Mel Gorman's avatar
      mm: optimize put_mems_allowed() usage · d26914d1
      Mel Gorman authored
      Since put_mems_allowed() is strictly optional, its a seqcount retry, we
      don't need to evaluate the function if the allocation was in fact
      successful, saving a smp_rmb some loads and comparisons on some relative
      fast-paths.
      
      Since the naming, get/put_mems_allowed() does suggest a mandatory
      pairing, rename the interface, as suggested by Mel, to resemble the
      seqcount interface.
      
      This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
      where it is important to note that the return value of the latter call
      is inverted from its previous incarnation.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d26914d1
    • David Rientjes's avatar
      mm, compaction: ignore pageblock skip when manually invoking compaction · 91ca9186
      David Rientjes authored
      The cached pageblock hint should be ignored when triggering compaction
      through /proc/sys/vm/compact_memory so all eligible memory is isolated.
      Manually invoking compaction is known to be expensive, there's no need
      to skip pageblocks based on heuristics (mainly for debugging).
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91ca9186
    • Vladimir Davydov's avatar
      mm: vmscan: remove shrink_control arg from do_try_to_free_pages() · 3115cd91
      Vladimir Davydov authored
      There is no need passing on a shrink_control struct from
      try_to_free_pages() and friends to do_try_to_free_pages() and then to
      shrink_zones(), because it is only used in shrink_zones() and the only
      field initialized on the top level is gfp_mask, which is always equal to
      scan_control.gfp_mask.  So let's move shrink_control initialization to
      shrink_zones().
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3115cd91
    • Vladimir Davydov's avatar
      mm: vmscan: move call to shrink_slab() to shrink_zones() · 65ec02cb
      Vladimir Davydov authored
      This reduces the indentation level of do_try_to_free_pages() and removes
      extra loop over all eligible zones counting the number of on-LRU pages.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarGlauber Costa <glommer@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65ec02cb
    • Vladimir Davydov's avatar
      mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim · 99120b77
      Vladimir Davydov authored
      When direct reclaim is executed by a process bound to a set of NUMA
      nodes, we should scan only those nodes when possible, but currently we
      will scan kmem from all online nodes even if the kmem shrinker is NUMA
      aware.  That said, binding a process to a particular NUMA node won't
      prevent it from shrinking inode/dentry caches from other nodes, which is
      not good.  Fix this.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99120b77
    • Ben Zhang's avatar
      kernel/watchdog.c: touch_nmi_watchdog should only touch local cpu not every one · 62572e29
      Ben Zhang authored
      I ran into a scenario where while one cpu was stuck and should have
      panic'd because of the NMI watchdog, it didn't.  The reason was another
      cpu was spewing stack dumps on to the console.  Upon investigation, I
      noticed that when writing to the console and also when dumping the
      stack, the watchdog is touched.
      
      This causes all the cpus to reset their NMI watchdog flags and the
      'stuck' cpu just spins forever.
      
      This change causes the semantics of touch_nmi_watchdog to be changed
      slightly.  Previously, I accidentally changed the semantics and we
      noticed there was a codepath in which touch_nmi_watchdog could be
      touched from a preemtible area.  That caused a BUG() to happen when
      CONFIG_DEBUG_PREEMPT was enabled.  I believe it was the acpi code.
      
      My attempt here re-introduces the change to have the
      touch_nmi_watchdog() code only touch the local cpu instead of all of the
      cpus.  But instead of using __get_cpu_var(), I use the
      __raw_get_cpu_var() version.
      
      This avoids the preemption problem.  However my reasoning wasn't because
      I was trying to be lazy.  Instead I rationalized it as, well if
      preemption is enabled then interrupts should be enabled to and the NMI
      watchdog will have no reason to trigger.  So it won't matter if the
      wrong cpu is touched because the percpu interrupt counters the NMI
      watchdog uses should still be incrementing.
      
      Don said:
      
      : I'm ok with this patch, though it does alter the behaviour of how
      : touch_nmi_watchdog works.  For the most part I don't think most callers
      : need to touch all of the watchdogs (on each cpu).  Perhaps a corner case
      : will pop up (the scheduler??  to mimic touch_all_softlockup_watchdogs() ).
      :
      : But this does address an issue where if a system is locked up and one cpu
      : is spewing out useful debug messages (or error messages), the hard lockup
      : will fail to go off.  We have seen this on RHEL also.
      Signed-off-by: default avatarDon Zickus <dzickus@redhat.com>
      Signed-off-by: default avatarBen Zhang <benzh@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62572e29
    • Dan Carpenter's avatar
      fs/direct-io.c: remove some left over checks · 45d4f855
      Dan Carpenter authored
      We know that "ret > 0" is true here.  These tests were left over from
      commit 02afc27f ('direct-io: Handle O_(D)SYNC AIO') and aren't
      needed any more.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45d4f855
    • Gu Zheng's avatar
      fs/direct-io.c: remove redundant comparison · 2b665e27
      Gu Zheng authored
      The return value of bio_get_nr_vecs() cannot be bigger than
      BIO_MAX_PAGES, so we can remove redundant the comparison between
      nr_pages and BIO_MAX_PAGES.
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b665e27
    • Wengang Wang's avatar
      ocfs2: pass "new" parameter to ocfs2_init_xattr_bucket · 9c339255
      Wengang Wang authored
      This patch fixes the following crash:
      
        kernel BUG at fs/ocfs2/uptodate.c:530!
        Modules linked in: ocfs2(F) ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd sunrpc 8021q garp stp llc bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iTCO_wdt iTCO_vendor_support dcdbas coretemp freq_table mperf microcode pcspkr serio_raw bnx2 lpc_ich mfd_core i5k_amb i5000_edac edac_core e1000e sg shpchp ext4(F) jbd2(F) mbcache(F) dm_round_robin(F) sr_mod(F) cdrom(F) usb_storage(F) sd_mod(F) crc_t10dif(F) pata_acpi(F) ata_generic(F) ata_piix(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) radeon(F)
         ttm(F) drm_kms_helper(F) drm(F) hwmon(F) i2c_algo_bit(F) i2c_core(F) dm_multipath(F) dm_mirror(F) dm_region_hash(F) dm_log(F) dm_mod(F)
        CPU 5
        Pid: 21303, comm: xattr-test Tainted: GF       W    3.8.13-30.el6uek.x86_64 #2 Dell Inc. PowerEdge 1950/0M788G
        RIP: ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2]
        Process xattr-test (pid: 21303, threadinfo ffff880017aca000, task ffff880016a2c480)
        Call Trace:
          ocfs2_init_xattr_bucket+0x8a/0x120 [ocfs2]
          ocfs2_cp_xattr_bucket+0xbb/0x1b0 [ocfs2]
          ocfs2_extend_xattr_bucket+0x20a/0x2f0 [ocfs2]
          ocfs2_add_new_xattr_bucket+0x23e/0x4b0 [ocfs2]
          ocfs2_xattr_set_entry_index_block+0x13c/0x3d0 [ocfs2]
          ocfs2_xattr_block_set+0xf9/0x220 [ocfs2]
          __ocfs2_xattr_set_handle+0x118/0x710 [ocfs2]
          ocfs2_xattr_set+0x691/0x880 [ocfs2]
          ocfs2_xattr_user_set+0x46/0x50 [ocfs2]
          generic_setxattr+0x96/0xa0
          __vfs_setxattr_noperm+0x7b/0x170
          vfs_setxattr+0xbc/0xc0
          setxattr+0xde/0x230
          sys_fsetxattr+0xc6/0xf0
          system_call_fastpath+0x16/0x1b
        Code: 41 80 0c 24 01 48 89 df e8 7d f0 ff ff 4c 89 e6 48 89 df e8 a2 fe ff ff 48 89 df e8 3a f0 ff ff 48 8b 1c 24 4c 8b 64 24 08 c9 c3 <0f> 0b eb fe 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 66 66
        RIP  ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2]
      
      It hit the BUG_ON() in ocfs2_set_new_buffer_uptodate():
      
          void ocfs2_set_new_buffer_uptodate(struct ocfs2_caching_info *ci,
                                             struct buffer_head *bh)
          {
                /* This should definitely *not* exist in our cache */
                if (ocfs2_buffer_cached(ci, bh))
                        printk(KERN_ERR "bh->b_blocknr: %lu @ %p\n", bh->b_blocknr, bh);
                BUG_ON(ocfs2_buffer_cached(ci, bh));
      
                set_buffer_uptodate(bh);
      
                ocfs2_metadata_cache_io_lock(ci);
                ocfs2_set_buffer_uptodate(ci, bh);
                ocfs2_metadata_cache_io_unlock(ci);
          }
      
      The problem here is:
      
      We cached a block, but the buffer_head got reused.  When we are to pick
      up this block again, a new buffer_head created with UPTODATE flag
      cleared.  ocfs2_buffer_uptodate() returned false since no UPTODATE is
      set on the buffer_head.  so we set this block to cache as a NEW block,
      then it failed at asserting block is not in cache.
      
      The fix is to add a new parameter indicating the bucket is a new
      allocated or not to ocfs2_init_xattr_bucket().
      ocfs2_init_xattr_bucket() assert block not cached accordingly.
      Signed-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joe Jin <joe.jin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c339255
    • jiangyiwen's avatar
      ocfs2: avoid system inode ref confusion by adding mutex lock · 43b10a20
      jiangyiwen authored
      The following case may lead to the same system inode ref in confusion.
      
      A thread                            B thread
      ocfs2_get_system_file_inode
      ->get_local_system_inode
      ->_ocfs2_get_system_file_inode
                                          because of *arr == NULL,
                                          ocfs2_get_system_file_inode
                                          ->get_local_system_inode
                                          ->_ocfs2_get_system_file_inode
      gets first ref thru
      _ocfs2_get_system_file_inode,
      gets second ref thru igrab and
      set *arr = inode
                                          at the moment, B thread also gets
                                          two refs, so lead to one more
                                          inode ref.
      
      So add mutex lock to avoid multi thread set two inode ref once at the
      same time.
      Signed-off-by: default avatarjiangyiwen <jiangyiwen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43b10a20
    • jiangyiwen's avatar
      ocfs2: iput inode alloc when failed locally · 7dc3e839
      jiangyiwen authored
      In ocfs2_info_handle_freeinode() and ocfs2_test_inode_bit() func, after
      calls ocfs2_get_system_file_inode() to get inode ref, if calls
      ocfs2_info_scan_inode_alloc() or ocfs2_inode_lock() failed, we should
      iput inode alloc to avoid leaking the inode.
      Signed-off-by: default avatarjiangyiwen <jiangyiwen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dc3e839
    • Tariq Saeed's avatar
      ocfs2/o2net: o2net_listen_data_ready should do nothing if socket state is not TCP_LISTEN · da8ded40
      Tariq Saeed authored
      Orabug: 17330860
      
      When accepting an incomming connection o2net_accept_one clones a child
      data socket from the parent listening socket.  It then proceeds to setup
      the child with callback o2net_data_ready() and sk_user_data to NULL.  If
      data arrives in this window, o2net_listen_data_ready will be called with
      some non-deterministic value in sk_user_data (not inherited).  We panic
      when we page fault on sk_user_data -- in parent it is
      sock_def_readable().
      
      The fix is to recognize that this is a data socket being set up by
      looking at the socket state and do nothing.
      Signed-off-by: default avatarTariq Saseed <tariq.x.saeed@oracle.com>
      Signed-off-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da8ded40
    • Younger Liu's avatar
      ocfs2: rollback alloc_dinode counts when ocfs2_block_group_set_bits() failed · db66c715
      Younger Liu authored
      After updating alloc_dinode counts in ocfs2_alloc_dinode_update_counts(),
      if ocfs2_alloc_dinode_update_bitmap() failed, there is a rare case that
      some space may be lost.
      
      So, roll back alloc_dinode counts when ocfs2_block_group_set_bits()
      failed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarYounger Liu <younger.liucn@gmail.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db66c715
    • Wengang Wang's avatar
      ocfs2: flock: drop cross-node lock when failed locally · e228f643
      Wengang Wang authored
      ocfs2_do_flock() calls ocfs2_file_lock() to get the cross-node clock and
      then call flock_lock_file_wait() to compete with local processes.  In
      case flock_lock_file_wait() failed, say -ENOMEM, clean up work is not
      done.  This patch adds the cleanup --drop the cross-node lock which was
      just granted.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e228f643
    • Darrick J. Wong's avatar
      ocfs2: call ocfs2_update_inode_fsync_trans when updating any inode · 6fdb702d
      Darrick J. Wong authored
      Ensure that ocfs2_update_inode_fsync_trans() is called any time we touch
      an inode in a given transaction.  This is a follow-on to the previous
      patch to reduce lock contention and deadlocking during an fsync
      operation.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Wengang <wen.gang.wang@oracle.com>
      Cc: Greg Marsden <greg.marsden@oracle.com>
      Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fdb702d
    • Tetsuo Handa's avatar
      ocfs2: fix panic on kfree(xattr->name) · f81c2015
      Tetsuo Handa authored
      Commit 9548906b ('xattr: Constify ->name member of "struct xattr"')
      missed that ocfs2 is calling kfree(xattr->name).  As a result, kernel
      panic occurs upon calling kfree(xattr->name) because xattr->name refers
      static constant names.  This patch removes kfree(xattr->name) from
      ocfs2_mknod() and ocfs2_symlink().
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarTariq Saeed <tariq.x.saeed@oracle.com>
      Tested-by: default avatarTariq Saeed <tariq.x.saeed@oracle.com>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f81c2015
    • alex chen's avatar
      ocfs2: do not put bh when buffer_uptodate failed · f7cf4f5b
      alex chen authored
      Do not put bh when buffer_uptodate failed in ocfs2_write_block and
      ocfs2_write_super_or_backup, because it will put bh in b_end_io.
      Otherwise it will hit a warning "VFS: brelse: Trying to free free
      buffer".
      Signed-off-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7cf4f5b
    • Xue jiufei's avatar
      ocfs2: __ocfs2_mknod_locked should return error when ocfs2_create_new_inode_locks() failed · 466e68c4
      Xue jiufei authored
      When ocfs2_create_new_inode_locks() return error, inode open lock may
      not be obtainted for this inode.  So other nodes can remove this file
      and free dinode when inode still remain in memory on this node, which is
      not correct and may trigger BUG.  So __ocfs2_mknod_locked should return
      error when ocfs2_create_new_inode_locks() failed.
      
                    Node_1                              Node_2
      create fileA, call ocfs2_mknod()
        -> ocfs2_get_init_inode(), allocate inodeA
        -> ocfs2_claim_new_inode(), claim dinode(dinodeA)
        -> call ocfs2_create_new_inode_locks(),
           create open lock failed, return error
        -> __ocfs2_mknod_locked return success
      
                                                      unlink fileA
                                                      try open lock succeed,
                                                      and free dinodeA
      
      create another file, call ocfs2_mknod()
        -> ocfs2_get_init_inode(), allocate inodeB
        -> ocfs2_claim_new_inode(), as Node_2 had freed dinodeA,
           so claim dinodeA and update generation for dinodeA
      
      call __ocfs2_drop_dl_inodes()->ocfs2_delete_inode()
      to free inodeA, and finally triggers BUG
      on(inode->i_generation != le32_to_cpu(fe->i_generation))
      in function ocfs2_inode_lock_update().
      Signed-off-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      466e68c4
    • Tariq Saeed's avatar
      ocfs2: allow for more than one data extent when creating xattr · 3ed2be71
      Tariq Saeed authored
      Orabug: 18108070
      
      ocfs2_xattr_extend_allocation() hits panic when creating xattr during
      data extent alloc phase.  The problem occurs if due to local alloc
      fragmentation, clusters are spread over multiple extents.  In this case
      ocfs2_add_clusters_in_btree() finds no space to store more than one
      extent record and therefore fails returning RESTART_META.  The situation
      is anticipated for xattr update case but not xattr create case.  This
      fix simply ports that code to create case.
      Signed-off-by: default avatarTariq Saeed <tariq.x.saeed@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ed2be71
    • Zhonghua Guo's avatar
      ocfs2: fix deadlock risk when kmalloc failed in dlm_query_region_handler · a35ad97c
      Zhonghua Guo authored
      In dlm_query_region_handler(), once kmalloc failed, it will unlock
      dlm_domain_lock without lock first, then deadlock happens.
      Signed-off-by: default avatarZhonghua Guo <guozhonghua@h3c.com>
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Tested-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a35ad97c
    • Jensen's avatar
      ocfs2: llseek requires ocfs2 inode lock for the file in SEEK_END · c8d888d9
      Jensen authored
      llseek requires ocfs2 inode lock for updating the file size in SEEK_END.
      because the file size maybe update on another node.
      
      This bug can be reproduce the following scenario: at first, we dd a test
      fileA, the file size is 10k.
      
      on NodeA:
      ---------
       1) open the test fileA, lseek the end of file. and print the position.
       2) close the test fileA
      
      on NodeB:
       1) open the test fileA, append the 5k data to test FileA.
       2) lseek the end of file. and print the position.
       3) close file.
      
      At first we run the test program1 on NodeA , the result is 10k.  And
      then run the test program2 on NodeB, the result is 15k.  At last, we run
      the test program1 on NodeA again, the result is 10k.
      
      After applying this patch the three step result is 15k.
      
      test result: 1000000 times lseek call;
      index        lseek with inode lock (unit:us)                lseek without inode lock (unit:us)
        1                   1168162                                    555383
        2                   1168011                                    549504
        3                   1170538                                    549396
        4                   1170375                                    551685
        5                   1170444                                    556719
        6                   1174364                                    555307
        7                   1163294                                    551552
        8                   1170080                                    549350
        9                   1162464                                    553700
       10                   1165441                                    552594
       avg                  1168317                                    552519
      
      avg with lock - avg without lock = 615798
      (avg with lock - avg without lock)/1000000=0.615798 us
      Signed-off-by: default avatarJensen <shencanquan@huawei.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8d888d9
    • Joseph Qi's avatar
      ocfs2: fix type conversion risk when get cluster attributes · 41b63efb
      Joseph Qi authored
      In o2nm_cluster, cl_idle_timeout_ms, cl_keepalive_delay_ms, as well as
      cl_reconnect_delay_ms, are defined as type of unsigned int.  So we
      should also use unsigned int in the helper functions.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41b63efb
    • Goldwyn Rodrigues's avatar
      ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock · 8ed6b237
      Goldwyn Rodrigues authored
      The following patches are reverted in this patch because these patches
      caused performance regression in the remote unlink() calls.
      
        ea455f8a - ocfs2: Push out dropping of dentry lock to ocfs2_wq
        f7b1aa69 - ocfs2: Fix deadlock on umount
        5fd13189 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount
      
      Previous patches in this series removed the possible deadlocks from
      downconvert thread so the above patches shouldn't be needed anymore.
      
      The regression is caused because these patches delay the iput() in case
      of dentry unlocks.  This also delays the unlocking of the open lockres.
      The open lockresource is required to test if the inode can be wiped from
      disk or not.  When the deleting node does not get the open lock, it
      marks it as orphan (even though it is not in use by another
      node/process) and causes a journal checkpoint.  This delays operations
      following the inode eviction.  This also moves the inode to the orphaned
      inode which further causes more I/O and a lot of unneccessary orphans.
      
      The following script can be used to generate the load causing issues:
      
        declare -a create
        declare -a remove
        declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
        unique="`mktemp -u XXXXX`"
        script="/tmp/idontknow-${unique}.sh"
        cat <<EOF > "${script}"
        for n in {1..8}; do mkdir -p test/dir\${n}
          eval touch test/dir\${n}/foo{1.."\$1"}
        done
        EOF
        chmod 700 "${script}"
      
        function fcreate ()
        {
          exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
        }
      
        function fremove ()
        {
          exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
        }
      
        function fcp ()
        {
          exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
        }
      
        echo -------------------------------------------------
        echo "| # files | create #s | copy #s | remove #s |"
        echo -------------------------------------------------
        for ((x=0; x < ${#iterations[*]} ; x++)) do
          create[$x]="`fcreate ${iterations[$x]}`"
          copy[$x]="`fcp ${iterations[$x]}`"
          remove[$x]="`fremove`"
          printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
        done
        rm "${script}"
        echo "------------------------"
      Signed-off-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ed6b237
    • Jan Kara's avatar
      ocfs2: avoid blocking in ocfs2_mark_lockres_freeing() in downconvert thread · 84d86f83
      Jan Kara authored
      If we are dropping last inode reference from downconvert thread, we will
      end up calling ocfs2_mark_lockres_freeing() which can block if the lock
      we are freeing is queued thus creating an A-A deadlock.  Luckily, since
      we are the downconvert thread, we can immediately dequeue the lock and
      thus avoid waiting in this case.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84d86f83
    • Jan Kara's avatar
      ocfs2: implement delayed dropping of last dquot reference · e3a767b6
      Jan Kara authored
      We cannot drop last dquot reference from downconvert thread as that
      creates the following deadlock:
      
      NODE 1                                  NODE2
      holds dentry lock for 'foo'
      holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
                                              dquot_initialize(bar)
                                                ocfs2_dquot_acquire()
                                                  ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                                                  ...
      downconvert thread (triggered from another
      node or a different process from NODE2)
        ocfs2_dentry_post_unlock()
          ...
          iput(foo)
            ocfs2_evict_inode(foo)
              ocfs2_clear_inode(foo)
                dquot_drop(inode)
                  ...
      	    ocfs2_dquot_release()
                    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                     - blocks
                                                  finds we need more space in
                                                  quota file
                                                  ...
                                                  ocfs2_extend_no_holes()
                                                    ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
                                                      - deadlocks waiting for
                                                        downconvert thread
      
      We solve the problem by postponing dropping of the last dquot reference to
      a workqueue if it happens from the downconvert thread.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3a767b6
    • Jan Kara's avatar
      quota: provide function to grab quota structure reference · 9f985cb6
      Jan Kara authored
      Provide dqgrab() function to get quota structure reference when we are
      sure it already has at least one active reference.  Make use of this
      function inside quota code.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f985cb6
    • Jan Kara's avatar
      ocfs2: move dquot_initialize() in ocfs2_delete_inode() somewhat later · bd62ad7a
      Jan Kara authored
      Move dquot_initalize() call in ocfs2_delete_inode() after the moment we
      verify inode is actually a sane one to delete.  We certainly don't want
      to initialize quota for system inodes etc.  This also avoids calling
      into quota code from downconvert thread.
      
      Add more details into the comment why bailing out from
      ocfs2_delete_inode() when we are in downconvert thread is OK.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd62ad7a
    • Jan Kara's avatar
      ocfs2: remove OCFS2_INODE_SKIP_DELETE flag · 7bf619c1
      Jan Kara authored
      The flag was never set, delete it.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7bf619c1
    • Goldwyn Rodrigues's avatar
      ocfs2: add dlm_recover_callback_support in sysfs · 765aabbb
      Goldwyn Rodrigues authored
      This is a part of the nocontrold feature which was incorporated sometime
      back.
      
      This is required for backward compatibility of the tools, specifically
      the scenario where the tools with recovery callback is used with a
      kernel not using the recovery callbacks (older kernel + newer tools).
      The tools look for this file to understand if the kernel supports DLM
      recovery callbacks.
      
      For kernels which support recovery callbacks but will miss this patch,
      ocfs2 will continue to use the older API and would still be able to
      mount the filesystem.
      
      [akpm@linux-foundation.org: simplify]
      [sfr@canb.auug.org.au: VERIFY_OCTAL_PERMISSIONS fix up]
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      765aabbb
    • Junxiao Bi's avatar
      ocfs2: dlm: fix recovery hung · ded2cf71
      Junxiao Bi authored
      There is a race window in dlm_do_recovery() between dlm_remaster_locks()
      and dlm_reset_recovery() when the recovery master nearly finish the
      recovery process for a dead node.  After the master sends FINALIZE_RECO
      message in dlm_remaster_locks(), another node may become the recovery
      master for another dead node, and then send the BEGIN_RECO message to
      all the nodes included the old master, in the handler of this message
      dlm_begin_reco_handler() of old master, dlm->reco.dead_node and
      dlm->reco.new_master will be set to the second dead node and the new
      master, then in dlm_reset_recovery(), these two variables will be reset
      to default value.  This will cause new recovery master can not finish
      the recovery process and hung, at last the whole cluster will hung for
      recovery.
      
      old recovery master:                                 new recovery master:
      dlm_remaster_locks()
                                                        become recovery master for
                                                        another dead node.
                                                        dlm_send_begin_reco_message()
      dlm_begin_reco_handler()
      {
       if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
        return -EAGAIN;
       }
       dlm_set_reco_master(dlm, br->node_idx);
       dlm_set_reco_dead_node(dlm, br->dead_node);
      }
      dlm_reset_recovery()
      {
       dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
       dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
      }
                                                        will hang in dlm_remaster_locks() for
                                                        request dlm locks info
      
      Before send FINALIZE_RECO message, recovery master should set
      DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done,
      this can break the race windows as the BEGIN_RECO messages will not be
      handled before DLM_RECO_STATE_FINALIZE flag is cleared.
      
      A similar race may happen between new recovery master and normal node
      which is in dlm_finalize_reco_handler(), also fix it.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Reviewed-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ded2cf71
    • Junxiao Bi's avatar
      ocfs2: dlm: fix lock migration crash · 34aa8dac
      Junxiao Bi authored
      This issue was introduced by commit 800deef3 ("ocfs2: use
      list_for_each_entry where benefical") in 2007 where it replaced
      list_for_each with list_for_each_entry.  The variable "lock" will point
      to invalid data if "tmpq" list is empty and a panic will be triggered
      due to this.  Sunil advised reverting it back, but the old version was
      also not right.  At the end of the outer for loop, that
      list_for_each_entry will also set "lock" to an invalid data, then in the
      next loop, if the "tmpq" list is empty, "lock" will be an stale invalid
      data and cause the panic.  So reverting the list_for_each back and reset
      "lock" to NULL to fix this issue.
      
      Another concern is that this seemes can not happen because the "tmpq"
      list should not be empty.  Let me describe how.
      
      old lock resource owner(node 1):                                  migratation target(node 2):
      image there's lockres with a EX lock from node 2 in
      granted list, a NR lock from node x with convert_type
      EX in converting list.
      dlm_empty_lockres() {
       dlm_pick_migration_target() {
         pick node 2 as target as its lock is the first one
         in granted list.
       }
       dlm_migrate_lockres() {
         dlm_mark_lockres_migrating() {
           res->state |= DLM_LOCK_RES_BLOCK_DIRTY;
           wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
      	 //after the above code, we can not dirty lockres any more,
           // so dlm_thread shuffle list will not run
                                                                         downconvert lock from EX to NR
                                                                         upconvert lock from NR to EX
      <<< migration may schedule out here, then
      <<< node 2 send down convert request to convert type from EX to
      <<< NR, then send up convert request to convert type from NR to
      <<< EX, at this time, lockres granted list is empty, and two locks
      <<< in the converting list, node x up convert lock followed by
      <<< node 2 up convert lock.
      
      	 // will set lockres RES_MIGRATING flag, the following
      	 // lock/unlock can not run
           dlm_lockres_release_ast(dlm, res);
         }
      
         dlm_send_one_lockres()
                                                                       dlm_process_recovery_data()
                                                                         for (i=0; i<mres->num_locks; i++)
                                                                           if (ml->node == dlm->node_num)
                                                                             for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
                                                                              list_for_each_entry(lock, tmpq, list)
                                                                              if (lock) break; <<< lock is invalid as grant list is empty.
                                                                             }
                                                                             if (lock->ml.node != ml->node)
                                                                               BUG() >>> crash here
       }
      
      I see the above locks status from a vmcore of our internal bug.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34aa8dac
    • Darrick J. Wong's avatar
      ocfs2: improve fsync efficiency and fix deadlock between aio_write and sync_file · 2931cdcb
      Darrick J. Wong authored
      Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
      transaction to complete.  This isn't terribly efficient, since sync_file
      really only needs to wait for the last transaction involving that inode
      to complete, and this doesn't require i_mutex.
      
      Therefore, implement the necessary bits to track the newest tid
      associated with an inode, and teach sync_file to wait for that instead
      of waiting for everything in the journal to commit.  Furthermore, only
      issue the flush request to the drive if jbd2 hasn't already done so.
      
      This also eliminates the deadlock between ocfs2_file_aio_write() and
      ocfs2_sync_file().  aio_write takes i_mutex then calls
      ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
      However, if that dio completion involves calling fsync, then we can get
      into trouble when some ocfs2_sync_file tries to take i_mutex.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2931cdcb