1. 13 Dec, 2012 20 commits
    • Lai Jiangshan's avatar
      procfs: use N_MEMORY instead N_HIGH_MEMORY · 4ff1b2c2
      Lai Jiangshan authored
      N_HIGH_MEMORY stands for the nodes that has normal or high memory.
      N_MEMORY stands for the nodes that has any memory.
      
      The code here need to handle with the nodes which have memory, we should
      use N_MEMORY instead.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ff1b2c2
    • Lai Jiangshan's avatar
      cpuset: use N_MEMORY instead N_HIGH_MEMORY · 38d7bee9
      Lai Jiangshan authored
      N_HIGH_MEMORY stands for the nodes that has normal or high memory.
      N_MEMORY stands for the nodes that has any memory.
      
      The code here need to handle with the nodes which have memory, we should
      use N_MEMORY instead.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d7bee9
    • Lai Jiangshan's avatar
      mm: node_states: introduce N_MEMORY · 8219fc48
      Lai Jiangshan authored
      We have N_NORMAL_MEMORY for standing for the nodes that have normal memory
      with zone_type <= ZONE_NORMAL.
      
      And we have N_HIGH_MEMORY for standing for the nodes that have normal or
      high memory.
      
      But we don't have any word to stand for the nodes that have *any* memory.
      
      And we have N_CPU but without N_MEMORY.
      
      Current code reuse the N_HIGH_MEMORY for this purpose because any node
      which has memory must have high memory or normal memory currently.
      
      A)	But this reusing is bad for *readability*. Because the name
      	N_HIGH_MEMORY just stands for high or normal:
      
      A.example 1)
      	mem_cgroup_nr_lru_pages():
      		for_each_node_state(nid, N_HIGH_MEMORY)
      
      	The user will be confused(why this function just counts for high or
      	normal memory node? does it counts for ZONE_MOVABLE's lru pages?)
      	until someone else tell them N_HIGH_MEMORY is reused to stand for
      	nodes that have any memory.
      
      A.cont) If we introduce N_MEMORY, we can reduce this confusing
      	AND make the code more clearly:
      
      A.example 2) mm/page_cgroup.c use N_HIGH_MEMORY twice:
      
      	One is in page_cgroup_init(void):
      		for_each_node_state(nid, N_HIGH_MEMORY) {
      
      	It means if the node have memory, we will allocate page_cgroup map for
      	the node. We should use N_MEMORY instead here to gaim more clearly.
      
      	The second using is in alloc_page_cgroup():
      		if (node_state(nid, N_HIGH_MEMORY))
      			addr = vzalloc_node(size, nid);
      
      	It means if the node has high or normal memory that can be allocated
      	from kernel. We should keep N_HIGH_MEMORY here, and it will be better
      	if the "any memory" semantic of N_HIGH_MEMORY is removed.
      
      B)	This reusing is out-dated if we introduce MOVABLE-dedicated node.
      	The MOVABLE-dedicated node should not appear in
      	node_stats[N_HIGH_MEMORY] nor node_stats[N_NORMAL_MEMORY],
      	because MOVABLE-dedicated node has no high or normal memory.
      
      	In x86_64, N_HIGH_MEMORY=N_NORMAL_MEMORY, if a MOVABLE-dedicated node
      	is in node_stats[N_HIGH_MEMORY], it is also means it is in
      	node_stats[N_NORMAL_MEMORY], it causes SLUB wrong.
      
      	The slub uses
      		for_each_node_state(nid, N_NORMAL_MEMORY)
      	and creates kmem_cache_node for MOVABLE-dedicated node and cause problem.
      
      In one word, we need a N_MEMORY.  We just intrude it as an alias to
      N_HIGH_MEMORY and fix all im-proper usages of N_HIGH_MEMORY in late
      patches.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8219fc48
    • Marek Szyprowski's avatar
      mm: use migrate_prep() instead of migrate_prep_local() · be49a6e1
      Marek Szyprowski authored
      __alloc_contig_migrate_range() should use all possible ways to get all the
      pages migrated from the given memory range, so pruning per-cpu lru lists
      for all CPUs is required, regadless the cost of such operation.  Otherwise
      some pages which got stuck at per-cpu lru list might get missed by
      migration procedure causing the contiguous allocation to fail.
      Reported-by: default avatarSeongHwan Yoon <sunghwan.yun@samsung.com>
      Signed-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarKyungmin Park <kyungmin.park@samsung.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be49a6e1
    • Thierry Reding's avatar
      mm: compaction: Fix compiler warning · c8bf2d8b
      Thierry Reding authored
      compact_capture_page() is only used if compaction is enabled so it should
      be moved into the corresponding #ifdef.
      Signed-off-by: default avatarThierry Reding <thierry.reding@avionic-design.de>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8bf2d8b
    • Kirill A. Shutemov's avatar
      thp: avoid race on multiple parallel page faults to the same page · 3ea41e62
      Kirill A. Shutemov authored
      pmd value is stable only with mm->page_table_lock taken. After taking
      the lock we need to check that nobody modified the pmd before changing it.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: default avatarBob Liu <lliubbo@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ea41e62
    • Kirill A. Shutemov's avatar
      thp: introduce sysfs knob to disable huge zero page · 79da5407
      Kirill A. Shutemov authored
      By default kernel tries to use huge zero page on read page fault.  It's
      possible to disable huge zero page by writing 0 or enable it back by
      writing 1:
      
      echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
      echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79da5407
    • Kirill A. Shutemov's avatar
      thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events · d8a8e1f0
      Kirill A. Shutemov authored
      hzp_alloc is incremented every time a huge zero page is successfully
      	allocated. It includes allocations which where dropped due
      	race with other allocation. Note, it doesn't count every map
      	of the huge zero page, only its allocation.
      
      hzp_alloc_failed is incremented if kernel fails to allocate huge zero
      	page and falls back to using small pages.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8a8e1f0
    • Kirill A. Shutemov's avatar
      thp: implement refcounting for huge zero page · 97ae1749
      Kirill A. Shutemov authored
      H.  Peter Anvin doesn't like huge zero page which sticks in memory forever
      after the first allocation.  Here's implementation of lockless refcounting
      for huge zero page.
      
      We have two basic primitives: {get,put}_huge_zero_page(). They
      manipulate reference counter.
      
      If counter is 0, get_huge_zero_page() allocates a new huge page and takes
      two references: one for caller and one for shrinker.  We free the page
      only in shrinker callback if counter is 1 (only shrinker has the
      reference).
      
      put_huge_zero_page() only decrements counter.  Counter is never zero in
      put_huge_zero_page() since shrinker holds on reference.
      
      Freeing huge zero page in shrinker callback helps to avoid frequent
      allocate-free.
      
      Refcounting has cost.  On 4 socket machine I observe ~1% slowdown on
      parallel (40 processes) read page faulting comparing to lazy huge page
      allocation.  I think it's pretty reasonable for synthetic benchmark.
      
      [lliubbo@gmail.com: fix mismerge]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarBob Liu <lliubbo@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97ae1749
    • Kirill A. Shutemov's avatar
      thp: lazy huge zero page allocation · 78ca0e67
      Kirill A. Shutemov authored
      Instead of allocating huge zero page on hugepage_init() we can postpone it
      until first huge zero page map. It saves memory if THP is not in use.
      
      cmpxchg() is used to avoid race on huge_zero_pfn initialization.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78ca0e67
    • Kirill A. Shutemov's avatar
      thp: setup huge zero page on non-write page fault · 80371957
      Kirill A. Shutemov authored
      All code paths seems covered. Now we can map huge zero page on read page
      fault.
      
      We setup it in do_huge_pmd_anonymous_page() if area around fault address
      is suitable for THP and we've got read page fault.
      
      If we fail to setup huge zero page (ENOMEM) we fallback to
      handle_pte_fault() as we normally do in THP.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80371957
    • Kirill A. Shutemov's avatar
      thp: implement splitting pmd for huge zero page · c5a647d0
      Kirill A. Shutemov authored
      We can't split huge zero page itself (and it's bug if we try), but we
      can split the pmd which points to it.
      
      On splitting the pmd we create a table with all ptes set to normal zero
      page.
      
      [akpm@linux-foundation.org: fix build error]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5a647d0
    • Kirill A. Shutemov's avatar
      thp: change split_huge_page_pmd() interface · e180377f
      Kirill A. Shutemov authored
      Pass vma instead of mm and add address parameter.
      
      In most cases we already have vma on the stack. We provides
      split_huge_page_pmd_mm() for few cases when we have mm, but not vma.
      
      This change is preparation to huge zero pmd splitting implementation.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e180377f
    • Kirill A. Shutemov's avatar
      thp: change_huge_pmd(): make sure we don't try to make a page writable · cad7f613
      Kirill A. Shutemov authored
      mprotect core never tries to make page writable using change_huge_pmd().
      Let's add an assert that the assumption is true.  It's important to be
      sure we will not make huge zero page writable.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cad7f613
    • Kirill A. Shutemov's avatar
      thp: do_huge_pmd_wp_page(): handle huge zero page · 93b4796d
      Kirill A. Shutemov authored
      On write access to huge zero page we alloc a new huge page and clear it.
      
      If ENOMEM, graceful fallback: we create a new pmd table and set pte around
      fault address to newly allocated normal (4k) page.  All other ptes in the
      pmd set to normal zero page.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93b4796d
    • Kirill A. Shutemov's avatar
      thp: copy_huge_pmd(): copy huge zero page · fc9fe822
      Kirill A. Shutemov authored
      It's easy to copy huge zero page. Just set destination pmd to huge zero
      page.
      
      It's safe to copy huge zero page since we have none yet :-p
      
      [rientjes@google.com: fix comment]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc9fe822
    • Kirill A. Shutemov's avatar
      thp: zap_huge_pmd(): zap huge zero pmd · 479f0abb
      Kirill A. Shutemov authored
      We don't have a mapped page to zap in huge zero page case.  Let's just clear
      pmd and remove it from tlb.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      479f0abb
    • Kirill A. Shutemov's avatar
      thp: huge zero page: basic preparation · 4a6c1297
      Kirill A. Shutemov authored
      During testing I noticed big (up to 2.5 times) memory consumption overhead
      on some workloads (e.g.  ft.A from NPB) if THP is enabled.
      
      The main reason for that big difference is lacking zero page in THP case.
      We have to allocate a real page on read page fault.
      
      A program to demonstrate the issue:
      #include <assert.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      #define MB 1024*1024
      
      int main(int argc, char **argv)
      {
              char *p;
              int i;
      
              posix_memalign((void **)&p, 2 * MB, 200 * MB);
              for (i = 0; i < 200 * MB; i+= 4096)
                      assert(p[i] == 0);
              pause();
              return 0;
      }
      
      With thp-never RSS is about 400k, but with thp-always it's 200M.  After
      the patcheset thp-always RSS is 400k too.
      
      Design overview.
      
      Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
      zeros.  The way how we allocate it changes in the patchset:
      
      - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
      - [09/10] lazy allocation on first use;
      - [10/10] lockless refcounting + shrinker-reclaimable hzp;
      
      We setup it in do_huge_pmd_anonymous_page() if area around fault address
      is suitable for THP and we've got read page fault.  If we fail to setup
      hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP.
      
      On wp fault to hzp we allocate real memory for the huge page and clear it.
       If ENOMEM, graceful fallback: we create a new pmd table and set pte
      around fault address to newly allocated normal (4k) page.  All other ptes
      in the pmd set to normal zero page.
      
      We cannot split hzp (and it's bug if we try), but we can split the pmd
      which points to it.  On splitting the pmd we create a table with all ptes
      set to normal zero page.
      
      ===
      
      By hpa's request I've tried alternative approach for hzp implementation
      (see Virtual huge zero page patchset): pmd table with all entries set to
      zero page.  This way should be more cache friendly, but it increases TLB
      pressure.
      
      The problem with virtual huge zero page: it requires per-arch enabling.
      We need a way to mark that pmd table has all ptes set to zero page.
      
      Some numbers to compare two implementations (on 4s Westmere-EX):
      
      Mirobenchmark1
      ==============
      
      test:
              posix_memalign((void **)&p, 2 * MB, 8 * GB);
              for (i = 0; i < 100; i++) {
                      assert(memcmp(p, p + 4*GB, 4*GB) == 0);
                      asm volatile ("": : :"memory");
              }
      
      hzp:
       Performance counter stats for './test_memcmp' (5 runs):
      
            32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                      40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
                       0 CPU-migrations            #    0.000 K/sec
                   4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
          76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
          36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
           1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
         134,355,715,816 instructions              #    1.75  insns per cycle
                                                   #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
          13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
               1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]
      
            32.413866442 seconds time elapsed                                          ( +-  0.13% )
      
      vhzp:
       Performance counter stats for './test_memcmp' (5 runs):
      
            30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                      38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
                       0 CPU-migrations            #    0.000 K/sec
                   4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
          71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
          31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
             773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
         134,982,215,437 instructions              #    1.88  insns per cycle
                                                   #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
          13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
               1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]
      
            30.381324695 seconds time elapsed                                          ( +-  0.13% )
      
      Mirobenchmark2
      ==============
      
      test:
              posix_memalign((void **)&p, 2 * MB, 8 * GB);
              for (i = 0; i < 1000; i++) {
                      char *_p = p;
                      while (_p < p+4*GB) {
                              assert(*_p == *(_p+4*GB));
                              _p += 4096;
                              asm volatile ("": : :"memory");
                      }
              }
      
      hzp:
       Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
      
             3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
                       9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
                   4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
           8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
           5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
           2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
           9,494,670,537 instructions              #    1.14  insns per cycle
                                                   #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
           2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
                 158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
           3,168,102,115 L1-dcache-loads
                #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
           1,048,710,998 L1-dcache-misses
               #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
           1,047,699,685 LLC-load
                       #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
                   2,287 LLC-misses
                     #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
           3,166,187,367 dTLB-loads
                     #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
               4,266,538 dTLB-misses
                    #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]
      
             3.513339813 seconds time elapsed                                          ( +-  0.26% )
      
      vhzp:
       Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
      
            27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
                      62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
                   4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
          64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
          61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
          56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
          10,033,724,846 instructions              #    0.15  insns per cycle
                                                   #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
           2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
               1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
           3,302,006,540 L1-dcache-loads
                #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
             271,374,358 L1-dcache-misses
               #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
              20,385,476 LLC-load
                       #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
                  76,754 LLC-misses
                     #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
           3,309,927,290 dTLB-loads
                     #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
           2,098,967,427 dTLB-misses
                    #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]
      
            27.364448741 seconds time elapsed                                          ( +-  0.24% )
      
      ===
      
      I personally prefer implementation present in this patchset. It doesn't
      touch arch-specific code.
      
      This patch:
      
      Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
      zeros.
      
      For now let's allocate the page on hugepage_init().  We'll switch to lazy
      allocation later.
      
      We are not going to map the huge zero page until we can handle it properly
      on all code paths.
      
      is_huge_zero_{pfn,pmd}() functions will be used by following patches to
      check whether the pfn/pmd is huge zero page.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a6c1297
    • Joonsoo Kim's avatar
      bootmem: remove alloc_arch_preferred_bootmem() · 3f7dfe24
      Joonsoo Kim authored
      The name of this function is not suitable, and removing the function and
      open-coding it into each call sites makes the code more understandable.
      
      Additionally, we shouldn't do an allocation from bootmem when
      slab_is_available(), so directly return kmalloc()'s return value.
      Signed-off-by: default avatarJoonsoo Kim <js1304@gmail.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f7dfe24
    • Joonsoo Kim's avatar
      bootmem: remove not implemented function call, bootmem_arch_preferred_node() · 2d7a6956
      Joonsoo Kim authored
      There is no implementation of bootmem_arch_preferred_node() and a call to
      this function will cause a compilation error.  So remove it.
      Signed-off-by: default avatarJoonsoo Kim <js1304@gmail.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d7a6956
  2. 12 Dec, 2012 20 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal · 9977d9b3
      Linus Torvalds authored
      Pull big execve/kernel_thread/fork unification series from Al Viro:
       "All architectures are converted to new model.  Quite a bit of that
        stuff is actually shared with architecture trees; in such cases it's
        literally shared branch pulled by both, not a cherry-pick.
      
        A lot of ugliness and black magic is gone (-3KLoC total in this one):
      
         - kernel_thread()/kernel_execve()/sys_execve() redesign.
      
           We don't do syscalls from kernel anymore for either kernel_thread()
           or kernel_execve():
      
           kernel_thread() is essentially clone(2) with callback run before we
           return to userland, the callbacks either never return or do
           successful do_execve() before returning.
      
           kernel_execve() is a wrapper for do_execve() - it doesn't need to
           do transition to user mode anymore.
      
           As a result kernel_thread() and kernel_execve() are
           arch-independent now - they live in kernel/fork.c and fs/exec.c
           resp.  sys_execve() is also in fs/exec.c and it's completely
           architecture-independent.
      
         - daemonize() is gone, along with its parts in fs/*.c
      
         - struct pt_regs * is no longer passed to do_fork/copy_process/
           copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.
      
         - sys_fork()/sys_vfork()/sys_clone() unified; some architectures
           still need wrappers (ones with callee-saved registers not saved in
           pt_regs on syscall entry), but the main part of those suckers is in
           kernel/fork.c now."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
        do_coredump(): get rid of pt_regs argument
        print_fatal_signal(): get rid of pt_regs argument
        ptrace_signal(): get rid of unused arguments
        get rid of ptrace_signal_deliver() arguments
        new helper: signal_pt_regs()
        unify default ptrace_signal_deliver
        flagday: kill pt_regs argument of do_fork()
        death to idle_regs()
        don't pass regs to copy_process()
        flagday: don't pass regs to copy_thread()
        bfin: switch to generic vfork, get rid of pointless wrappers
        xtensa: switch to generic clone()
        openrisc: switch to use of generic fork and clone
        unicore32: switch to generic clone(2)
        score: switch to generic fork/vfork/clone
        c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
        take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
        mn10300: switch to generic fork/vfork/clone
        h8300: switch to generic fork/vfork/clone
        tile: switch to generic clone()
        ...
      
      Conflicts:
      	arch/microblaze/include/asm/Kbuild
      9977d9b3
    • Linus Torvalds's avatar
      Merge tag 'boards' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · cf4af012
      Linus Torvalds authored
      Pull ARM SoC board updates from Olof Johansson:
       "This branch contains a set of various board updates for ARM platforms.
      
        A few shmobile platforms that are stale have been removed, some
        defconfig updates for various boards selecting new features such as
        pinctrl subsystem support, and various updates enabling peripherals,
        etc."
      
      Fix up conflicts mostly as per Olof.
      
      * tag 'boards' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (58 commits)
        ARM: S3C64XX: Add dummy supplies for Glenfarclas LDOs
        ARM: S3C64XX: Add registration of WM2200 Bells device on Cragganmore
        ARM: kirkwood: Add Plat'Home OpenBlocks A6 support
        ARM: Dove: update defconfig
        ARM: Kirkwood: update defconfig for new boards
        arm: orion5x: add DT related options in defconfig
        arm: orion5x: convert 'LaCie Ethernet Disk mini v2' to Device Tree
        arm: orion5x: basic Device Tree support
        arm: orion5x: mechanical defconfig update
        ARM: kirkwood: Add support for the MPL CEC4
        arm: kirkwood: add support for ZyXEL NSA310
        ARM: Kirkwood: new board USI Topkick
        ARM: kirkwood: use gpio-fan DT binding on lsxl
        ARM: Kirkwood: add Netspace boards to defconfig
        ARM: kirkwood: DT board setup for Network Space Mini v2
        ARM: kirkwood: DT board setup for Network Space Lite v2
        ARM: kirkwood: DT board setup for Network Space v2 and parents
        leds: leds-ns2: add device tree binding
        ARM: Kirkwood: Enable the second I2C bus
        ARM: mmp: select pinctrl driver
        ...
      cf4af012
    • Linus Torvalds's avatar
      Merge tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · d027db13
      Linus Torvalds authored
      Pull ARM SoC updates from Olof Johansson:
       "This contains the bulk of new SoC development for this merge window.
      
        Two new platforms have been added, the sunxi platforms (Allwinner A1x
        SoCs) by Maxime Ripard, and a generic Broadcom platform for a new
        series of ARMv7 platforms from them, where the hope is that we can
        keep the platform code generic enough to have them all share one mach
        directory.  The new Broadcom platform is contributed by Christian
        Daudt.
      
        Highbank has grown support for Calxeda's next generation of hardware,
        ECX-2000.
      
        clps711x has seen a lot of cleanup from Alexander Shiyan, and he's
        also taken on maintainership of the platform.
      
        Beyond this there has been a bunch of work from a number of people on
        converting more platforms to IRQ domains, pinctrl conversion, cleanup
        and general feature enablement across most of the active platforms."
      
      Fix up trivial conflicts as per Olof.
      
      * tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (174 commits)
        mfd: vexpress-sysreg: Remove LEDs code
        irqchip: irq-sunxi: Add terminating entry for sunxi_irq_dt_ids
        clocksource: sunxi_timer: Add terminating entry for sunxi_timer_dt_ids
        irq: versatile: delete dangling variable
        ARM: sunxi: add missing include for mdelay()
        ARM: EXYNOS: Avoid early use of of_machine_is_compatible()
        ARM: dts: add node for PL330 MDMA1 controller for exynos4
        ARM: EXYNOS: Add support for secondary CPU bring-up on Exynos4412
        ARM: EXYNOS: add UART3 to DEBUG_LL ports
        ARM: S3C24XX: Add clkdev entry for camif-upll clock
        ARM: SAMSUNG: Add s3c24xx/s3c64xx CAMIF GPIO setup helpers
        ARM: sunxi: Add missing sun4i.dtsi file
        pinctrl: samsung: Do not initialise statics to 0
        ARM i.MX6: remove gate_mask from pllv3
        ARM i.MX6: Fix ethernet PLL clocks
        ARM i.MX6: rename PLLs according to datasheet
        ARM i.MX6: Add pwm support
        ARM i.MX51: Add pwm support
        ARM i.MX53: Add pwm support
        ARM: mx5: Replace clk_register_clkdev with clock DT lookup
        ...
      d027db13
    • Linus Torvalds's avatar
      Merge tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · d01e4afd
      Linus Torvalds authored
      Pull ARM SoC cleanups on various subarchitectures from Olof Johansson:
       "Cleanup patches for various ARM platforms and some of their associated
        drivers.  There's also a branch in here that enables Freescale i.MX to
        be part of the multiplatform support -- the first "big" SoC that is
        moved over (more multiplatform work comes in a separate branch later
        during the merge window)."
      
      Conflicts fixed as per Olof, including a silent semantic one in
      arch/arm/mach-omap2/board-generic.c (omap_prcm_restart() was renamed to
      omap3xxx_restart(), and a new user of the old name was added).
      
      * tag 'cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (189 commits)
        ARM: omap: fix typo on timer cleanup
        ARM: EXYNOS: Remove unused regs-mem.h file
        ARM: EXYNOS: Remove unused non-dt support for dwmci controller
        ARM: Kirkwood: Use hw_pci.ops instead of hw_pci.scan
        ARM: OMAP3: cm-t3517: use GPTIMER for system clock
        ARM: OMAP2+: timer: remove CONFIG_OMAP_32K_TIMER
        ARM: SAMSUNG: use devm_ functions for ADC driver
        ARM: EXYNOS: no duplicate mask/unmask in eint0_15
        ARM: S3C24XX: SPI clock channel setup is fixed for S3C2443
        ARM: EXYNOS: Remove i2c0 resource information and setting of device names
        ARM: Kirkwood: checkpatch cleanups
        ARM: Kirkwood: Fix sparse warnings.
        ARM: Kirkwood: Remove unused includes
        ARM: kirkwood: cleanup lsxl board includes
        ARM: integrator: use BUG_ON where possible
        ARM: integrator: push down SC dependencies
        ARM: integrator: delete static UART1 mapping
        ARM: integrator: delete SC mapping on the CP
        ARM: integrator: remove static CP syscon mapping
        ARM: integrator: remove static AP syscon mapping
        ...
      d01e4afd
    • Linus Torvalds's avatar
      Merge tag 'headers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 8287361a
      Linus Torvalds authored
      Pull ARM SoC Header cleanups from Olof Johansson:
       "This is a collection of header file cleanups, mostly for OMAP and
        AT91, that keeps moving the platforms in the direction of
        multiplatform by removing the need for mach-dependent header files
        used in drivers and other places."
      
      Fix up mostly trivial conflicts as per Olof.
      
      * tag 'headers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (106 commits)
        ARM: OMAP2+: Move iommu/iovmm headers to platform_data
        ARM: OMAP2+: Make some definitions local
        ARM: OMAP2+: Move iommu2 to drivers/iommu/omap-iommu2.c
        ARM: OMAP2+: Move plat/iovmm.h to include/linux/omap-iommu.h
        ARM: OMAP2+: Move iopgtable header to drivers/iommu/
        ARM: OMAP: Merge iommu2.h into iommu.h
        atmel: move ATMEL_MAX_UART to platform_data/atmel.h
        ARM: OMAP: Remove omap_init_consistent_dma_size()
        arm: at91: move at91rm9200 rtc header in drivers/rtc
        arm: at91: move reset controller header to arm/arm/mach-at91
        arm: at91: move pit define to the driver
        arm: at91: move at91_shdwc.h to arch/arm/mach-at91
        arm: at91: move board header to arch/arm/mach-at91
        arn: at91: move at91_tc.h to arch/arm/mach-at91
        arm: at91 move at91_aic.h to arch/arm/mach-at91
        arm: at91 move board.h to arch/arm/mach-at91
        arm: at91: move platfarm_data to include/linux/platform_data/atmel.h
        arm: at91: drop machine defconfig
        ARM: OMAP: Remove NEED_MACH_GPIO_H
        ARM: OMAP: Remove unnecessary mach and plat includes
        ...
      8287361a
    • Linus Torvalds's avatar
      Merge tag 'fixes-non-critical' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 2989950c
      Linus Torvalds authored
      Pull ARM SoC Non-critical bug fixes from Olof Johansson:
       "Simple bug fixes that were not considered important enough for
        inclusion into 3.7, especially those that arrived late during the
        merge window.
      
        There's also a MAINTAINERS update for the Renesas platforms in here,
        marking Simon Horman as a maintainer and changing the git url to his
        tree."
      
      * tag 'fixes-non-critical' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        Update ARM/SHMOBILE section of MAINTAINERS
        ARM: Fix Kconfig symbols typo for LEDS
        ARM: pxa: add dummy SA1100 rtc clock in pxa25x
        ARM: pxa: fix pxa25x gpio wakeup setting
        ARM: OMAP4: PM: fix errata handling when CONFIG_PM=n
        ARM: cns3xxx: drop unnecessary symbol selection
        ARM: vexpress: fix ll debug code when building multiplatform
        ARM: OMAP4: retrigger localtimers after re-enabling gic
        ARM: OMAP4460: Workaround for ROM bug because of CA9 r2pX GIC control register change.
        ARM: OMAP4: PM: add errata support
        ARM: davinci: fix return value check by using IS_ERR in tnetv107x_devices_init()
        ARM: davinci: uncompress.h: bail out if uart not initialized
        ARM: davinci: serial.h: fix uart number in the comment
        ARM: davinci: dm644x evm: move pointer dereference below NULL check
        ARM: vexpress: Make the debug UART detection more specific
      2989950c
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.linaro.org/people/rmk/linux-arm · b1286f4e
      Linus Torvalds authored
      Pull ARM updates from Russell King:
       "Here's the updates for ARM for this merge window, which cover quite a
        variety of areas.
      
        There's a bunch of patch series from Will tackling various bugs like
        the PROT_NONE handling, ASID allocation, cluster boot protocol and
        ASID TLB tagging updates.
      
        We move to a build-time sorted exception table rather than doing the
        sorting at run-time, add support for the secure computing filter, and
        some updates to the perf code.  We also have sorted out the placement
        of some headers, fixed some build warnings, fixed some hotplug
        problems with the per-cpu TWD code."
      
      * 'for-linus' of git://git.linaro.org/people/rmk/linux-arm: (73 commits)
        ARM: 7594/1: Add .smp entry for REALVIEW_EB
        ARM: 7599/1: head: Remove boot-time HYP mode check for v5 and below
        ARM: 7598/1: net: bpf_jit_32: fix sp-relative load/stores offsets.
        ARM: 7595/1: syscall: rework ordering in syscall_trace_exit
        ARM: 7596/1: mmci: replace readsl/writesl with ioread32_rep/iowrite32_rep
        ARM: 7597/1: net: bpf_jit_32: fix kzalloc gfp/size mismatch.
        ARM: 7593/1: nommu: do not enable DCACHE_WORD_ACCESS when !CONFIG_MMU
        ARM: 7592/1: nommu: prevent generation of kernel unaligned memory accesses
        ARM: 7591/1: nommu: Enable the strict alignment (CR_A) bit only if ARCH < v6
        ARM: 7590/1: /proc/interrupts: limit the display of IPIs to online CPUs only
        ARM: 7587/1: implement optimized percpu variable access
        ARM: 7589/1: integrator: pass the lm resource to amba
        ARM: 7588/1: amba: create a resource parent registrator
        ARM: 7582/2: rename kvm_seq to vmalloc_seq so to avoid confusion with KVM
        ARM: 7585/1: kernel: fix nr_cpu_ids check in DT logical map init
        ARM: 7584/1: perf: fix link error when CONFIG_HW_PERF_EVENTS is not selected
        ARM: gic: use a private mapping for CPU target interfaces
        ARM: kernel: add logical mappings look-up
        ARM: kernel: add cpu logical map DT init in setup_arch
        ARM: kernel: add device tree init map function
        ...
      b1286f4e
    • Linus Torvalds's avatar
      Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 · 6facac1a
      Linus Torvalds authored
      Pull CIFS fixes from Steve French:
       "This includes a set of misc.  cifs fixes (most importantly some byte
        range lock related write fixes from Pavel, and some ACL and idmap
        related fixes from Jeff) but also includes the SMB2.02 dialect
        enablement, and a key fix for SMB3 mounts.
      
        Default authentication upgraded to ntlmv2 for cifs (it was already
        ntlmv2 for smb2)"
      
      * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (43 commits)
        CIFS: Fix write after setting a read lock for read oplock files
        cifs: parse the device name into UNC and prepath
        cifs: fix up handling of prefixpath= option
        cifs: clean up handling of unc= option
        cifs: fix SID binary to string conversion
        fix "disabling echoes and oplocks" on SMB2 mounts
        Do not send SMB2 signatures for SMB3 frames
        cifs: deal with id_to_sid embedded sid reply corner case
        cifs: fix hardcoded default security descriptor length
        cifs: extra sanity checking for cifs.idmap keys
        cifs: avoid extra allocation for small cifs.idmap keys
        cifs: simplify id_to_sid and sid_to_id mapping code
        CIFS: Fix possible data coherency problem after oplock break to None
        CIFS: Do not permit write to a range mandatory locked with a read lock
        cifs: rename cifs_readdir_lookup to cifs_prime_dcache and make it void return
        cifs: Add CONFIG_CIFS_DEBUG and rename use of CIFS_DEBUG
        cifs: Make CIFS_DEBUG possible to undefine
        SMB3 mounts fail with access denied to some servers
        cifs: Remove unused cEVENT macro
        cifs: always zero out smb_vol before parsing options
        ...
      6facac1a
    • Linus Torvalds's avatar
      Merge tag 'for-linus-v3.8-rc1' of git://oss.sgi.com/xfs/xfs · 3f1c64f4
      Linus Torvalds authored
      Pull xfs update from Ben Myers:
       "There is plenty going on, including the cleanup of xfssyncd, metadata
        verifiers, CRC infrastructure for the log, tracking of inodes with
        speculative allocation, a cleanup of xfs_fs_subr.c, fixes for
        XFS_IOC_ZERO_RANGE, and important fix related to log replay (only
        update the last_sync_lsn when a transaction completes), a fix for
        deadlock on AGF buffers, documentation and comment updates, and a few
        more cleanups and fixes.
      
        Details:
         - remove the xfssyncd mess
         - only update the last_sync_lsn when a transaction completes
         - zero allocation_args on the kernel stack
         - fix AGF/alloc workqueue deadlock
         - silence uninitialised f.file warning
         - Update inode alloc comments
         - Update mount options documentation
         - report projid32bit feature in geometry call
         - speculative preallocation inode tracking
         - fix attr tree double split corruption
         - fix broken error handling in xfs_vm_writepage
         - drop buffer io reference when a bad bio is built
         - add more attribute tree trace points
         - growfs infrastructure changes for 3.8
         - fs/xfs/xfs_fs_subr.c die die die
         - add CRC infrastructure
         - add CRC checks to the log
         - Remove description of nodelaylog mount option from xfs.txt
         - inode allocation should use unmapped buffers
         - byte range granularity for XFS_IOC_ZERO_RANGE
         - fix direct IO nested transaction deadlock
         - fix stray dquot unlock when reclaiming dquots
         - fix sparse reported log CRC endian issue"
      
      Fix up trivial conflict in fs/xfs/xfs_fsops.c due to the same patch
      having been applied twice (commits eaef8543 and 1375cb65: "xfs:
      growfs: don't read garbage for new secondary superblocks") with later
      updates to the affected code in the XFS tree.
      
      * tag 'for-linus-v3.8-rc1' of git://oss.sgi.com/xfs/xfs: (78 commits)
        xfs: fix sparse reported log CRC endian issue
        xfs: fix stray dquot unlock when reclaiming dquots
        xfs: fix direct IO nested transaction deadlock.
        xfs: byte range granularity for XFS_IOC_ZERO_RANGE
        xfs: inode allocation should use unmapped buffers.
        xfs: Remove the description of nodelaylog mount option from xfs.txt
        xfs: add CRC checks to the log
        xfs: add CRC infrastructure
        xfs: convert buffer verifiers to an ops structure.
        xfs: connect up write verifiers to new buffers
        xfs: add pre-write metadata buffer verifier callbacks
        xfs: add buffer pre-write callback
        xfs: Add verifiers to dir2 data readahead.
        xfs: add xfs_da_node verification
        xfs: factor and verify attr leaf reads
        xfs: factor dir2 leaf read
        xfs: factor out dir2 data block reading
        xfs: factor dir2 free block reading
        xfs: verify dir2 block format buffers
        xfs: factor dir2 block read operations
        ...
      3f1c64f4
    • Linus Torvalds's avatar
      Merge tag 'dlm-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm · 22a40fd9
      Linus Torvalds authored
      Pull dlm updates from David Teigland:
       "This set fixes some conditions in which value blocks are invalidated,
        and includes two trivial cleanups."
      
      * tag 'dlm-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
        dlm: fix lvb invalidation conditions
        fs/dlm: remove CONFIG_EXPERIMENTAL
        dlm: remove unused variable in *dlm_lowcomms_get_buffer()
      22a40fd9
    • Linus Torvalds's avatar
      Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · d206e090
      Linus Torvalds authored
      Pull cgroup changes from Tejun Heo:
       "A lot of activities on cgroup side.  The big changes are focused on
        making cgroup hierarchy handling saner.
      
         - cgroup_rmdir() had peculiar semantics - it allowed cgroup
           destruction to be vetoed by individual controllers and tried to
           drain refcnt synchronously.  The vetoing never worked properly and
           caused good deal of contortions in cgroup.  memcg was the last
           reamining user.  Michal Hocko removed the usage and cgroup_rmdir()
           path has been simplified significantly.  This was done in a
           separate branch so that the memcg people can base further memcg
           changes on top.
      
         - The above allowed cleaning up cgroup lifecycle management and
           implementation of generic cgroup iterators which are used to
           improve hierarchy support.
      
         - cgroup_freezer updated to allow migration in and out of a frozen
           cgroup and handle hierarchy.  If a cgroup is frozen, all descendant
           cgroups are frozen.
      
         - netcls_cgroup and netprio_cgroup updated to handle hierarchy
           properly.
      
         - Various fixes and cleanups.
      
         - Two merge commits.  One to pull in memcg and rmdir cleanups (needed
           to build iterators).  The other pulled in cgroup/for-3.7-fixes for
           device_cgroup fixes so that further device_cgroup patches can be
           stacked on top."
      
      Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
      commit bea8c150 ("memcg: fix hotplugged memory zone oops") in master
      touching code close to commit 2ef37d3f ("memcg: Simplify
      mem_cgroup_force_empty_list error handling") in for-3.8)
      
      * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
        cgroup: update Documentation/cgroups/00-INDEX
        cgroup_rm_file: don't delete the uncreated files
        cgroup: remove subsystem files when remounting cgroup
        cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
        cgroup: warn about broken hierarchies only after css_online
        cgroup: list_del_init() on removed events
        cgroup: fix lockdep warning for event_control
        cgroup: move list add after list head initilization
        netprio_cgroup: allow nesting and inherit config on cgroup creation
        netprio_cgroup: implement netprio[_set]_prio() helpers
        netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
        netprio_cgroup: reimplement priomap expansion
        netprio_cgroup: shorten variable names in extend_netdev_table()
        netprio_cgroup: simplify write_priomap()
        netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
        cgroup: remove obsolete guarantee from cgroup_task_migrate.
        cgroup: add cgroup->id
        cgroup, cpuset: remove cgroup_subsys->post_clone()
        cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
        cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
        ...
      d206e090
    • Linus Torvalds's avatar
      Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu · fef3ff2e
      Linus Torvalds authored
      Pull percpu changes from Tejun Heo:
       "Nothing exciting here either.  Joonsoo's is almost cosmetic.  Cyrill's
        patch fixes "percpu_alloc" early kernel param handling so that the
        kernel doesn't crash when the parameter is specified w/o any argument."
      
      * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
        mm, percpu: Make sure percpu_alloc early parameter has an argument
        percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()
      fef3ff2e
    • Linus Torvalds's avatar
      Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · e7b55b8f
      Linus Torvalds authored
      Pull workqueue changes from Tejun Heo:
       "Nothing exciting.  Just two trivial changes."
      
      * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up()
        workqueue: trivial fix for return statement in work_busy()
      e7b55b8f
    • Linus Torvalds's avatar
      Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux · 50851c62
      Linus Torvalds authored
      Pull thermal management update from Zhang Rui:
       "Highlights:
      
         - Introduction of thermal policy support, together with three new
           thermal governors, including step_wise, user_space, fire_share.
      
         - Introduction of ST-Ericsson db8500_thermal driver and ST-Ericsson
           db8500_cpufreq_cooling driver.
      
         - Thermal Kconfig file and Makefile refactor.
      
         - Fixes for generic thermal layer, generic cpucooling, rcar thermal
           driver and Exynos thermal driver."
      
      * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (36 commits)
        Thermal: Fix DEFAULT_THERMAL_GOVERNOR
        Thermal: fix a NULL pointer dereference when generic thermal layer is built as a module
        thermal: rcar: add rcar_zone_to_priv() macro
        thermal: rcar: fixup the unit of temperature
        thermal: cpu cooling: allow module builds
        thermal: cpu cooling: use const parameter while registering
        Thermal: Add ST-Ericsson DB8500 thermal properties and platform data.
        Thermal: Add ST-Ericsson DB8500 thermal driver.
        drivers/thermal/Makefile refactor
        Exynos: Add missing dependency
        Refactor drivers/thermal/Kconfig
        thermal: cpu_cooling: Make 'notify_device' static
        Thermal: Remove the cooling_cpufreq_list.
        Thermal: fix bug of counting cpu frequencies.
        Thermal: add indent for code alignment.
        thermal: rcar_thermal: remove explicitly used devm_kfree/iounap()
        thermal: user_space: Add missing static storage class specifiers
        thermal: fair_share: Add missing static storage class specifiers
        thermal: step_wise: Add missing static storage class specifiers
        Thermal: Fix oops and unlocking in thermal_sys.c
        ...
      50851c62
    • Linus Torvalds's avatar
      Merge tag 'regmap-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap · 99b8f42e
      Linus Torvalds authored
      Pull regmap updates from Mark Brown:
       "Quite a few enhancements this time around, helpers and diagnostics for
        the most part which is good to see:
      
         - Addition of table based lookups for the register access checks from
           Davide Ciminaghi, making life easier for drivers with big blocks of
           similar registers.
         - Allow drivers to get the irqdomain for regmap irq_chips, allowing
           the domain to be used with other APIs.
         - Debug improvements for paged register maps.
         - Performance improvments for some of the diagnostic infrastructure,
           very helpful for devices with large register maps."
      
      * tag 'regmap-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
        regmap: debugfs: Cache offsets of valid regions for dump
        regmap: debugfs: Factor out initial seek
        regmap: debugfs: Avoid overflows for very small reads
        regmap: Cache register and value sizes for debugfs
        regmap: introduce tables for readable/writeable/volatile/precious checks
        regmap: core: Report registers in hex when we can't cache
        regmap: Fix printing of size_t variable
        regmap: make lock/unlock functions customizable
        regmap: silence GCC warning
        regmap: Split raw writes that cross window boundaries
        regmap: Make return code checks consistent
        regmap: Factor range lookup out of page selection
        regmap: Provide debugfs read of register ranges
        regmap: Factor out debugfs register read
        regmap: Allow ranges to be named
        regmap: When we sanity check during range adds say what errors we find
        regmap: Rename n_ranges to num_ranges
        regmap: irq: Allow users to retrieve the irq_domain
      99b8f42e
    • Linus Torvalds's avatar
      Merge tag 'please-pull-einj-fix-for-acpi5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras · 139353ff
      Linus Torvalds authored
      Pull ACPI5 error injection fix from Tony Luck:
       "Trivial fix for error injection code using ACPI5 version of EINJ"
      
      * tag 'please-pull-einj-fix-for-acpi5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
        ACPI, APEI, EINJ: Add missed ACPI5 support for error trigger table
      139353ff
    • Linus Torvalds's avatar
      Merge tag 'please-pull-pstore_mevent' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · 251a8cfe
      Linus Torvalds authored
      Pull pstore fixes from Tony Luck:
       "Patch series to allow EFI variable backend to pstore to hold multiple
        records."
      
      * tag 'please-pull-pstore_mevent' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
        efi_pstore: Add a format check for an existing variable name at erasing time
        efi_pstore: Add a format check for an existing variable name at reading time
        efi_pstore: Add a sequence counter to a variable name
        efi_pstore: Add ctime to argument of erase callback
        efi_pstore: Remove a logic erasing entries from a write callback to hold multiple logs
        efi_pstore: Add a logic erasing entries to an erase callback
        efi_pstore: Check remaining space with QueryVariableInfo() before writing data
      251a8cfe
    • Linus Torvalds's avatar
      Merge tag 'please-pull-misc-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · 70f2836d
      Linus Torvalds authored
      Pull ia64 fix from Tony Luck:
       "Miscellaneous ia64 fix for 3.8.  Just need to avoid a pending
        namespace collision from other work being merged."
      
      * tag 'please-pull-misc-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
        [IA64] Resolve name space collision for cache_show()
      70f2836d
    • Linus Torvalds's avatar
      Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 · 97ebe8f5
      Linus Torvalds authored
      Pull ARM64 updates from Catalin Marinas:
      
       - Generic execve, kernel_thread, fork/vfork/clone.
      
       - Preparatory patches for KVM support (initialising EL2 mode for later
         installing KVM support, hypervisor stub).
      
       - Signal handling corner case fix (alternative signal stack set up for
         a SEGV handler, which is raised in response to RLIMIT_STACK being
         reached).
      
       - Sub-nanosecond timer error fix.
      
      * tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (30 commits)
        arm64: Update the MAINTAINERS entry
        arm64: compat for clock_adjtime(2) is miswired
        arm64: move FP-SIMD save/restore code to a macro
        arm64: hyp: initialize vttbr_el2 to zero
        arm64: add hypervisor stub
        arm64: record boot mode when entering the kernel
        arm64: move vector entry macro to assembler.h
        arm64: add AArch32 execution modes to ptrace.h
        arm64: expand register mapping between AArch32 and AArch64
        arm64: generic timer: use virtual counter instead of physical at EL0
        arm64: vdso: defer shifting of nanosecond component of timespec
        arm64: vdso: rework __do_get_tspec register allocation and return shift
        arm64: vdso: check sequence counter even for coarse realtime operations
        arm64: vdso: fix clocksource mask when extracting bottom 56 bits
        ARM64: Remove incorrect Kconfig symbol HAVE_SPARSE_IRQ
        Documentation: Fixes a word in Documentation/arm64/memory.txt
        arm64: Make !dirty ptes read-only
        arm64: Convert empty flush_cache_{mm,page} functions to static inline
        arm64: signal: let the compiler inline compat_get_sigframe
        arm64: signal: return struct rt_sigframe from get_sigframe
        ...
      
      Conflicts:
      	arch/arm64/include/asm/unistd32.h
      97ebe8f5
    • Linus Torvalds's avatar
      Merge branch 'omap-serial' of git://git.linaro.org/people/rmk/linux-arm · d07e43d7
      Linus Torvalds authored
      Pull ARM OMAP serial updates from Russell King:
       "This series is a major reworking of the OMAP serial driver code fixing
        various bugs in the hardware-assisted flow control, extending up into
        serial_core for a couple of issues.  These fixes have been done as a
        set of progressive changes and transformations in the hope that no new
        bugs will be introduced by this series.
      
        The problems are many-fold, from the driver not being informed about
        updated settings, to the driver not knowing what the intentions of the
        upper layers are.
      
        The first four patches tackle the serial_core layer, allowing it to
        provide the necessary information to drivers, and the remaining
        patches allow the OMAP serial driver to take advantage of this.
      
        This brings hardware assisted RTS/CTS and XON/OFF flow control into a
        useful state.
      
        These patches have been in linux-next for most of the last cycle;
        indeed they predate the previous merge window.  They've also been
        posted to the OMAP people."
      
      * 'omap-serial' of git://git.linaro.org/people/rmk/linux-arm: (21 commits)
        SERIAL: omap: fix hardware assisted flow control
        SERIAL: omap: simplify (2)
        SERIAL: omap: move xon/xoff setting earlier
        SERIAL: omap: always set TCR
        SERIAL: omap: simplify
        SERIAL: omap: don't read back LCR/MCR/EFR
        SERIAL: omap: serial_omap_configure_xonxoff() contents into set_termios
        SERIAL: omap: configure xon/xoff before setting modem control lines
        SERIAL: omap: remove OMAP_UART_SYSC_RESET and OMAP_UART_FIFO_CLR
        SERIAL: omap: move driver private definitions and structures to driver
        SERIAL: omap: remove 'irq_pending' bitfield
        SERIAL: omap: fix MCR TCRTLR bit handling
        SERIAL: omap: fix set_mctrl() breakage
        SERIAL: omap: no need to re-read EFR
        SERIAL: omap: remove setting of EFR SCD bit
        SERIAL: omap: allow hardware assisted IXANY mode to be disabled
        SERIAL: omap: allow hardware assisted rts/cts modes to be disabled
        SERIAL: core: add throttle/unthrottle callbacks for hardware assisted flow control
        SERIAL: core: add hardware assisted h/w flow control support
        SERIAL: core: add hardware assisted s/w flow control support
        ...
      
      Conflicts:
      	drivers/tty/serial/omap-serial.c
      d07e43d7