1. 26 Sep, 2014 22 commits
    • KOSAKI Motohiro's avatar
      mm: __rmqueue_fallback() should respect pageblock type · e961e1e6
      KOSAKI Motohiro authored
      commit 0cbef29a upstream.
      
      When __rmqueue_fallback() doesn't find a free block with the required size
      it splits a larger page and puts the rest of the page onto the free list.
      
      But it has one serious mistake.  When putting back, __rmqueue_fallback()
      always use start_migratetype if type is not CMA.  However,
      __rmqueue_fallback() is only called when all of the start_migratetype
      queue is empty.  That said, __rmqueue_fallback always puts back memory to
      the wrong queue except try_to_steal_freepages() changed pageblock type
      (i.e.  requested size is smaller than half of page block).  The end result
      is that the antifragmentation framework increases fragmenation instead of
      decreasing it.
      
      Mel's original anti fragmentation does the right thing.  But commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.
      
      This patch restores sane and old behavior.  It also removes an incorrect
      comment which was introduced by commit fef903ef ("mm/page_alloc.c:
      restructure free-page stealing code and fix a bug").
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e961e1e6
    • KOSAKI Motohiro's avatar
      mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag() · 0a0809a8
      KOSAKI Motohiro authored
      commit 52c8f6a5 upstream.
      
      In general, every tracepoint should be zero overhead if it is disabled.
      However, trace_mm_page_alloc_extfrag() is one of exception.  It evaluate
      "new_type == start_migratetype" even if tracepoint is disabled.
      
      However, the code can be moved into tracepoint's TP_fast_assign() and
      TP_fast_assign exist exactly such purpose.  This patch does it.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0a0809a8
    • Damien Ramonda's avatar
      readahead: fix sequential read cache miss detection · cd5900b2
      Damien Ramonda authored
      commit af248a0c upstream.
      
      The kernel's readahead algorithm sometimes interprets random read
      accesses as sequential and triggers unnecessary data prefecthing from
      storage device (impacting random read average latency).
      
      In order to identify sequential cache read misses, the readahead
      algorithm intends to check whether offset - previous offset == 1
      (trivial sequential reads) or offset - previous offset == 0 (sequential
      reads not aligned on page boundary):
      
        if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
      
      The current offset is stored in the "offset" variable of type "pgoff_t"
      (unsigned long), while previous offset is stored in "ra->prev_pos" of
      type "loff_t" (long long).  Therefore, operands of the if statement are
      implicitly converted to type long long.  Consequently, when previous
      offset > current offset (which happens on random pattern), the if
      condition is true and access is wrongly interpeted as sequential.  An
      unnecessary data prefetching is triggered, impacting the average random
      read latency.
      
      Storing the previous offset value in a "pgoff_t" variable (unsigned
      long) fixes the sequential read detection logic.
      Signed-off-by: default avatarDamien Ramonda <damien.ramonda@intel.com>
      Reviewed-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarPierre Tardy <pierre.tardy@intel.com>
      Acked-by: default avatarDavid Cohen <david.a.cohen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cd5900b2
    • Dan Streetman's avatar
      swap: change swap_list_head to plist, add swap_avail_head · 56f7f361
      Dan Streetman authored
      commit 18ab4d4c upstream.
      
      Originally get_swap_page() started iterating through the singly-linked
      list of swap_info_structs using swap_list.next or highest_priority_index,
      which both were intended to point to the highest priority active swap
      target that was not full.  The first patch in this series changed the
      singly-linked list to a doubly-linked list, and removed the logic to start
      at the highest priority non-full entry; it starts scanning at the highest
      priority entry each time, even if the entry is full.
      
      Replace the manually ordered swap_list_head with a plist, swap_active_head.
      Add a new plist, swap_avail_head.  The original swap_active_head plist
      contains all active swap_info_structs, as before, while the new
      swap_avail_head plist contains only swap_info_structs that are active and
      available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
      the swap_avail_head list.
      
      Mel Gorman suggested using plists since they internally handle ordering
      the list entries based on priority, which is exactly what swap was doing
      manually.  All the ordering code is now removed, and swap_info_struct
      entries and simply added to their corresponding plist and automatically
      ordered correctly.
      
      Using a new plist for available swap_info_structs simplifies and
      optimizes get_swap_page(), which no longer has to iterate over full
      swap_info_structs.  Using a new spinlock for swap_avail_head plist
      allows each swap_info_struct to add or remove themselves from the
      plist when they become full or not-full; previously they could not
      do so because the swap_info_struct->lock is held when they change
      from full<->not-full, and the swap_lock protecting the main
      swap_active_head must be ordered before any swap_info_struct->lock.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      56f7f361
    • Dan Streetman's avatar
      lib/plist: add plist_requeue · af1f48ee
      Dan Streetman authored
      commit a75f232c upstream.
      
      Add plist_requeue(), which moves the specified plist_node after all other
      same-priority plist_nodes in the list.  This is essentially an optimized
      plist_del() followed by plist_add().
      
      This is needed by swap, which (with the next patch in this set) uses a
      plist of available swap devices.  When a swap device (either a swap
      partition or swap file) are added to the system with swapon(), the device
      is added to a plist, ordered by the swap device's priority.  When swap
      needs to allocate a page from one of the swap devices, it takes the page
      from the first swap device on the plist, which is the highest priority
      swap device.  The swap device is left in the plist until all its pages are
      used, and then removed from the plist when it becomes full.
      
      However, as described in man 2 swapon, swap must allocate pages from swap
      devices with the same priority in round-robin order; to do this, on each
      swap page allocation, swap uses a page from the first swap device in the
      plist, and then calls plist_requeue() to move that swap device entry to
      after any other same-priority swap devices.  The next swap page allocation
      will again use a page from the first swap device in the plist and requeue
      it, and so on, resulting in round-robin usage of equal-priority swap
      devices.
      
      Also add plist_test_requeue() test function, for use by plist_test() to
      test plist_requeue() function.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      af1f48ee
    • Dan Streetman's avatar
      lib/plist: add helper functions · 80e85acd
      Dan Streetman authored
      commit fd16618e upstream.
      
      Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
      define and initialize a struct plist_head.
      
      Add plist_for_each_continue() and plist_for_each_entry_continue(),
      equivalent to list_for_each_continue() and list_for_each_entry_continue(),
      to iterate over a plist continuing after the current position.
      
      Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
      and ->next, implemented by list_prev_entry() and list_next_entry(), to
      access the prev/next struct plist_node entry.  These are needed because
      unlike struct list_head, direct access of the prev/next struct plist_node
      isn't possible; the list must be navigated via the contained struct
      list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
      node_list) it can be accessed by plist_prev(node).
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      80e85acd
    • Dan Streetman's avatar
      swap: change swap_info singly-linked list to list_head · 75b1f2d3
      Dan Streetman authored
      commit adfab836 upstream.
      
      The logic controlling the singly-linked list of swap_info_struct entries
      for all active, i.e.  swapon'ed, swap targets is rather complex, because:
      
       - it stores the entries in priority order
       - there is a pointer to the highest priority entry
       - there is a pointer to the highest priority not-full entry
       - there is a highest_priority_index variable set outside the swap_lock
       - swap entries of equal priority should be used equally
      
      this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
      where different priority swap targets are incorrectly used equally.
      
      That bug probably could be solved with the existing singly-linked lists,
      but I think it would only add more complexity to the already difficult to
      understand get_swap_page() swap_list iteration logic.
      
      The first patch changes from a singly-linked list to a doubly-linked list
      using list_heads; the highest_priority_index and related code are removed
      and get_swap_page() starts each iteration at the highest priority
      swap_info entry, even if it's full.  While this does introduce unnecessary
      list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
      one or more of the highest priority entries are full, the iteration and
      manipulation code is much simpler and behaves correctly re: the above bug;
      and the fourth patch removes the unnecessary iteration.
      
      The second patch adds some minor plist helper functions; nothing new
      really, just functions to match existing regular list functions.  These
      are used by the next two patches.
      
      The third patch adds plist_requeue(), which is used by get_swap_page() in
      the next patch - it performs the requeueing of same-priority entries
      (which moves the entry to the end of its priority in the plist), so that
      all equal-priority swap_info_structs get used equally.
      
      The fourth patch converts the main list into a plist, and adds a new plist
      that contains only swap_info entries that are both active and not full.
      As Mel suggested using plists allows removing all the ordering code from
      swap - plists handle ordering automatically.  The list naming is also
      clarified now that there are two lists, with the original list changed
      from swap_list_head to swap_active_head and the new list named
      swap_avail_head.  A new spinlock is also added for the new list, so
      swap_info entries can be added or removed from the new list immediately as
      they become full or not full.
      
      This patch (of 4):
      
      Replace the singly-linked list tracking active, i.e.  swapon'ed,
      swap_info_struct entries with a doubly-linked list using struct
      list_heads.  Simplify the logic iterating and manipulating the list of
      entries, especially get_swap_page(), by using standard list_head
      functions, and removing the highest priority iteration logic.
      
      The change fixes the bug:
      https://lkml.org/lkml/2014/2/13/181
      in which different priority swap entries after the highest priority entry
      are incorrectly used equally in pairs.  The swap behavior is now as
      advertised, i.e. different priority swap entries are used in order, and
      equal priority swap targets are used concurrently.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      75b1f2d3
    • Michal Hocko's avatar
      mm: exclude memoryless nodes from zone_reclaim · f14c889d
      Michal Hocko authored
      commit 70ef57e6 upstream.
      
      We had a report about strange OOM killer strikes on a PPC machine
      although there was a lot of swap free and a tons of anonymous memory
      which could be swapped out.  In the end it turned out that the OOM was a
      side effect of zone reclaim which wasn't unmapping and swapping out and
      so the system was pushed to the OOM.  Although this sounds like a bug
      somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
      interaction numactl on the said hardware suggests that the zone reclaim
      should not have been set in the first place:
      
        node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 7168 MB
        node 2 free: 6019 MB
        node distances:
        node   0   2
        0:  10  40
        2:  40  10
      
      So all the CPUs are associated with Node0 which doesn't have any memory
      while Node2 contains all the available memory.  Node distances cause an
      automatic zone_reclaim_mode enabling.
      
      Zone reclaim is intended to keep the allocations local but this doesn't
      make any sense on the memoryless nodes.  So let's exclude such nodes for
      init_zone_allows_reclaim which evaluates zone reclaim behavior and
      suitable reclaim_nodes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f14c889d
    • Nishanth Aravamudan's avatar
      hugetlb: ensure hugepage access is denied if hugepages are not supported · a32d8674
      Nishanth Aravamudan authored
      commit 457c1b27 upstream.
      
      Currently, I am seeing the following when I `mount -t hugetlbfs /none
      /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
      related to the fact that hugetlbfs is properly not correctly setting
      itself up in this state?:
      
        Unable to handle kernel paging request for data at address 0x00000031
        Faulting instruction address: 0xc000000000245710
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        ....
      
      In KVM guests on Power, in a guest not backed by hugepages, we see the
      following:
      
        AnonHugePages:         0 kB
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:         64 kB
      
      HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
      are not supported at boot-time, but this is only checked in
      hugetlb_init().  Extract the check to a helper function, and use it in a
      few relevant places.
      
      This does make hugetlbfs not supported (not registered at all) in this
      environment.  I believe this is fine, as there are no valid hugepages
      and that won't change at runtime.
      
      [akpm@linux-foundation.org: use pr_info(), per Mel]
      [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a32d8674
    • Hugh Dickins's avatar
      mm: fix bad rss-counter if remap_file_pages raced migration · 75858faa
      Hugh Dickins authored
      commit 88784396 upstream.
      
      Fix some "Bad rss-counter state" reports on exit, arising from the
      interaction between page migration and remap_file_pages(): zap_pte()
      must count a migration entry when zapping it.
      
      And yes, it is possible (though very unusual) to find an anon page or
      swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
      get_user_pages(write, force) case which COWs even in a shared mapping.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: Sasha Levin sasha.levin@oracle.com>
      Tested-by: Dave Jones davej@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      75858faa
    • Han Pingtian's avatar
      mm: prevent setting of a value less than 0 to min_free_kbytes · 73210d0a
      Han Pingtian authored
      commit da8c757b upstream.
      
      If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang.  Changing
      proc_dointvec() to proc_dointvec_minmax() in the
      min_free_kbytes_sysctl_handler() can prevent this to happen.
      
      mhocko said:
      
      : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
      : your machine unusable but I agree that proc_dointvec_minmax is more
      : suitable here as we already have:
      :
      : 	.proc_handler   = min_free_kbytes_sysctl_handler,
      : 	.extra1         = &zero,
      :
      : It used to work properly but then 6fce56ec ("sysctl: Remove references
      : to ctl_name and strategy from the generic sysctl table") has removed
      : sysctl_intvec strategy and so extra1 is ignored.
      Signed-off-by: default avatarHan Pingtian <hanpt@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      73210d0a
    • Joonsoo Kim's avatar
      slab: correct pfmemalloc check · 3ddc614c
      Joonsoo Kim authored
      commit 73293c2f upstream.
      
      We checked pfmemalloc by slab unit, not page unit. You can see this
      in is_slab_pfmemalloc(). So other pages don't need to be set/cleared
      pfmemalloc.
      
      And, therefore we should check pfmemalloc in page flag of first page,
      but current implementation don't do that. virt_to_head_page(obj) just
      return 'struct page' of that object, not one of first page, since the SLAB
      don't use __GFP_COMP when CONFIG_MMU. To get 'struct page' of first page,
      we first get a slab and try to get it via virt_to_head_page(slab->s_mem).
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarPekka Enberg <penberg@iki.fi>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3ddc614c
    • Bob Liu's avatar
      mm: thp: khugepaged: add policy for finding target node · fc2dd02e
      Bob Liu authored
      commit 9f1b868a upstream.
      
      Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
      hugepage which is allocated from the node of the first scanned normal
      page, but this policy is too rough and may end with unexpected result to
      upper users.
      
      The problem is the original page-balancing among all nodes will be
      broken after hugepaged started.  Thinking about the case if the first
      scanned normal page is allocated from node A, most of other scanned
      normal pages are allocated from node B or C..  But hugepaged will always
      allocate hugepage from node A which will cause extra memory pressure on
      node A which is not the situation before khugepaged started.
      
      This patch try to fix this problem by making khugepaged allocate
      hugepage from the node which have max record of scaned normal pages hit,
      so that the effect to original page-balancing can be minimized.
      
      The other problem is if normal scanned pages are equally allocated from
      Node A,B and C, after khugepaged started Node A will still suffer extra
      memory pressure.
      
      Andrew Davidoff reported a related issue several days ago.  He wanted
      his application interleaving among all nodes and "numactl
      --interleave=all ./test" was used to run the testcase, but the result
      wasn't not as expected.
      
        cat /proc/2814/numa_maps:
        7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098
      
      The end result showed that most pages are from Node3 instead of
      interleave among node0-3 which was unreasonable.
      
      This patch also fix this issue by allocating hugepage round robin from
      all nodes have the same record, after this patch the result was as
      expected:
      
        7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722
      
      The simple testcase is like this:
      
      int main() {
      	char *p;
      	int i;
      	int j;
      
      	for (i=0; i < 200; i++) {
      		p = (char *)malloc(1048576);
      		printf("malloc done\n");
      
      		if (p == 0) {
      			printf("Out of memory\n");
      			return 1;
      		}
      		for (j=0; j < 1048576; j++) {
      			p[j] = 'A';
      		}
      		printf("touched memory\n");
      
      		sleep(1);
      	}
      	printf("enter sleep\n");
      	while(1) {
      		sleep(100);
      	}
      }
      
      [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
      Reported-by: default avatarAndrew Davidoff <davidoff@qedmf.net>
      Tested-by: default avatarAndrew Davidoff <davidoff@qedmf.net>
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fc2dd02e
    • Bob Liu's avatar
      mm: thp: cleanup: mv alloc_hugepage to better place · c3bd31a1
      Bob Liu authored
      commit 10dc4155 upstream.
      
      Move alloc_hugepage() to a better place, no need for a seperate #ifndef
      CONFIG_NUMA
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrew Davidoff <davidoff@qedmf.net>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c3bd31a1
    • Jiri Slaby's avatar
      Linux 3.12.29 · b45ddfa2
      Jiri Slaby authored
      b45ddfa2
    • Will Deacon's avatar
      arm64: flush TLS registers during exec · 7bcae251
      Will Deacon authored
      commit eb35bdd7 upstream.
      
      Nathan reports that we leak TLS information from the parent context
      during an exec, as we don't clear the TLS registers when flushing the
      thread state.
      
      This patch updates the flushing code so that we:
      
        (1) Unconditionally zero the tpidr_el0 register (since this is fully
            context switched for native tasks and zeroed for compat tasks)
      
        (2) Zero the tp_value state in thread_info before clearing the
            tpidrr0_el0 register for compat tasks (since this is only writable
            by the set_tls compat syscall and therefore not fully switched).
      
      A missing compiler barrier is also added to the compat set_tls syscall.
      Acked-by: default avatarNathan Lynch <Nathan_Lynch@mentor.com>
      Reported-by: default avatarNathan Lynch <Nathan_Lynch@mentor.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7bcae251
    • Jeff Moyer's avatar
      aio: add missing smp_rmb() in read_events_ring · 0fbdd4f7
      Jeff Moyer authored
      commit 2ff396be upstream.
      
      We ran into a case on ppc64 running mariadb where io_getevents would
      return zeroed out I/O events.  After adding instrumentation, it became
      clear that there was some missing synchronization between reading the
      tail pointer and the events themselves.  This small patch fixes the
      problem in testing.
      
      Thanks to Zach for helping to look into this, and suggesting the fix.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0fbdd4f7
    • Anton Blanchard's avatar
      ibmveth: Fix endian issues with rx_no_buffer statistic · e66983c5
      Anton Blanchard authored
      commit cbd52281 upstream.
      
      Hidden away in the last 8 bytes of the buffer_list page is a solitary
      statistic. It needs to be byte swapped or else ethtool -S will
      produce numbers that terrify the user.
      
      Since we do this in multiple places, create a helper function with a
      comment explaining what is going on.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e66983c5
    • Murali Karicheri's avatar
      ahci: add pcid for Marvel 0x9182 controller · be11da66
      Murali Karicheri authored
      commit c5edfff9 upstream.
      
      Keystone K2E EVM uses Marvel 0x9182 controller. This requires support
      for the ID in the ahci driver.
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      be11da66
    • James Ralston's avatar
      ahci: Add Device IDs for Intel 9 Series PCH · 1b935cf4
      James Ralston authored
      commit 1b071a09 upstream.
      
      This patch adds the AHCI mode SATA Device IDs for the Intel 9 Series PCH.
      Signed-off-by: default avatarJames Ralston <james.d.ralston@intel.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1b935cf4
    • Arjun Sreedharan's avatar
      pata_scc: propagate return value of scc_wait_after_reset · 6cd2641b
      Arjun Sreedharan authored
      commit 4dc7c76c upstream.
      
      scc_bus_softreset not necessarily should return zero.
      Propagate the error code.
      Signed-off-by: default avatarArjun Sreedharan <arjun024@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6cd2641b
    • Tejun Heo's avatar
      libata: widen Crucial M550 blacklist matching · 61ee2622
      Tejun Heo authored
      commit 2a13772a upstream.
      
      Crucial M550 may cause data corruption on queued trims and is
      blacklisted.  The pattern used for it fails to match 1TB one as the
      capacity section will be four chars instead of three.  Widen the
      pattern.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarCharles Reiss <woggling@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81071Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      61ee2622
  2. 18 Sep, 2014 18 commits