1. 26 Sep, 2014 19 commits
    • Dan Streetman's avatar
      swap: change swap_list_head to plist, add swap_avail_head · 56f7f361
      Dan Streetman authored
      commit 18ab4d4c upstream.
      
      Originally get_swap_page() started iterating through the singly-linked
      list of swap_info_structs using swap_list.next or highest_priority_index,
      which both were intended to point to the highest priority active swap
      target that was not full.  The first patch in this series changed the
      singly-linked list to a doubly-linked list, and removed the logic to start
      at the highest priority non-full entry; it starts scanning at the highest
      priority entry each time, even if the entry is full.
      
      Replace the manually ordered swap_list_head with a plist, swap_active_head.
      Add a new plist, swap_avail_head.  The original swap_active_head plist
      contains all active swap_info_structs, as before, while the new
      swap_avail_head plist contains only swap_info_structs that are active and
      available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
      the swap_avail_head list.
      
      Mel Gorman suggested using plists since they internally handle ordering
      the list entries based on priority, which is exactly what swap was doing
      manually.  All the ordering code is now removed, and swap_info_struct
      entries and simply added to their corresponding plist and automatically
      ordered correctly.
      
      Using a new plist for available swap_info_structs simplifies and
      optimizes get_swap_page(), which no longer has to iterate over full
      swap_info_structs.  Using a new spinlock for swap_avail_head plist
      allows each swap_info_struct to add or remove themselves from the
      plist when they become full or not-full; previously they could not
      do so because the swap_info_struct->lock is held when they change
      from full<->not-full, and the swap_lock protecting the main
      swap_active_head must be ordered before any swap_info_struct->lock.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      56f7f361
    • Dan Streetman's avatar
      lib/plist: add plist_requeue · af1f48ee
      Dan Streetman authored
      commit a75f232c upstream.
      
      Add plist_requeue(), which moves the specified plist_node after all other
      same-priority plist_nodes in the list.  This is essentially an optimized
      plist_del() followed by plist_add().
      
      This is needed by swap, which (with the next patch in this set) uses a
      plist of available swap devices.  When a swap device (either a swap
      partition or swap file) are added to the system with swapon(), the device
      is added to a plist, ordered by the swap device's priority.  When swap
      needs to allocate a page from one of the swap devices, it takes the page
      from the first swap device on the plist, which is the highest priority
      swap device.  The swap device is left in the plist until all its pages are
      used, and then removed from the plist when it becomes full.
      
      However, as described in man 2 swapon, swap must allocate pages from swap
      devices with the same priority in round-robin order; to do this, on each
      swap page allocation, swap uses a page from the first swap device in the
      plist, and then calls plist_requeue() to move that swap device entry to
      after any other same-priority swap devices.  The next swap page allocation
      will again use a page from the first swap device in the plist and requeue
      it, and so on, resulting in round-robin usage of equal-priority swap
      devices.
      
      Also add plist_test_requeue() test function, for use by plist_test() to
      test plist_requeue() function.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      af1f48ee
    • Dan Streetman's avatar
      lib/plist: add helper functions · 80e85acd
      Dan Streetman authored
      commit fd16618e upstream.
      
      Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
      define and initialize a struct plist_head.
      
      Add plist_for_each_continue() and plist_for_each_entry_continue(),
      equivalent to list_for_each_continue() and list_for_each_entry_continue(),
      to iterate over a plist continuing after the current position.
      
      Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
      and ->next, implemented by list_prev_entry() and list_next_entry(), to
      access the prev/next struct plist_node entry.  These are needed because
      unlike struct list_head, direct access of the prev/next struct plist_node
      isn't possible; the list must be navigated via the contained struct
      list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
      node_list) it can be accessed by plist_prev(node).
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      80e85acd
    • Dan Streetman's avatar
      swap: change swap_info singly-linked list to list_head · 75b1f2d3
      Dan Streetman authored
      commit adfab836 upstream.
      
      The logic controlling the singly-linked list of swap_info_struct entries
      for all active, i.e.  swapon'ed, swap targets is rather complex, because:
      
       - it stores the entries in priority order
       - there is a pointer to the highest priority entry
       - there is a pointer to the highest priority not-full entry
       - there is a highest_priority_index variable set outside the swap_lock
       - swap entries of equal priority should be used equally
      
      this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
      where different priority swap targets are incorrectly used equally.
      
      That bug probably could be solved with the existing singly-linked lists,
      but I think it would only add more complexity to the already difficult to
      understand get_swap_page() swap_list iteration logic.
      
      The first patch changes from a singly-linked list to a doubly-linked list
      using list_heads; the highest_priority_index and related code are removed
      and get_swap_page() starts each iteration at the highest priority
      swap_info entry, even if it's full.  While this does introduce unnecessary
      list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
      one or more of the highest priority entries are full, the iteration and
      manipulation code is much simpler and behaves correctly re: the above bug;
      and the fourth patch removes the unnecessary iteration.
      
      The second patch adds some minor plist helper functions; nothing new
      really, just functions to match existing regular list functions.  These
      are used by the next two patches.
      
      The third patch adds plist_requeue(), which is used by get_swap_page() in
      the next patch - it performs the requeueing of same-priority entries
      (which moves the entry to the end of its priority in the plist), so that
      all equal-priority swap_info_structs get used equally.
      
      The fourth patch converts the main list into a plist, and adds a new plist
      that contains only swap_info entries that are both active and not full.
      As Mel suggested using plists allows removing all the ordering code from
      swap - plists handle ordering automatically.  The list naming is also
      clarified now that there are two lists, with the original list changed
      from swap_list_head to swap_active_head and the new list named
      swap_avail_head.  A new spinlock is also added for the new list, so
      swap_info entries can be added or removed from the new list immediately as
      they become full or not full.
      
      This patch (of 4):
      
      Replace the singly-linked list tracking active, i.e.  swapon'ed,
      swap_info_struct entries with a doubly-linked list using struct
      list_heads.  Simplify the logic iterating and manipulating the list of
      entries, especially get_swap_page(), by using standard list_head
      functions, and removing the highest priority iteration logic.
      
      The change fixes the bug:
      https://lkml.org/lkml/2014/2/13/181
      in which different priority swap entries after the highest priority entry
      are incorrectly used equally in pairs.  The swap behavior is now as
      advertised, i.e. different priority swap entries are used in order, and
      equal priority swap targets are used concurrently.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      75b1f2d3
    • Michal Hocko's avatar
      mm: exclude memoryless nodes from zone_reclaim · f14c889d
      Michal Hocko authored
      commit 70ef57e6 upstream.
      
      We had a report about strange OOM killer strikes on a PPC machine
      although there was a lot of swap free and a tons of anonymous memory
      which could be swapped out.  In the end it turned out that the OOM was a
      side effect of zone reclaim which wasn't unmapping and swapping out and
      so the system was pushed to the OOM.  Although this sounds like a bug
      somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
      interaction numactl on the said hardware suggests that the zone reclaim
      should not have been set in the first place:
      
        node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 7168 MB
        node 2 free: 6019 MB
        node distances:
        node   0   2
        0:  10  40
        2:  40  10
      
      So all the CPUs are associated with Node0 which doesn't have any memory
      while Node2 contains all the available memory.  Node distances cause an
      automatic zone_reclaim_mode enabling.
      
      Zone reclaim is intended to keep the allocations local but this doesn't
      make any sense on the memoryless nodes.  So let's exclude such nodes for
      init_zone_allows_reclaim which evaluates zone reclaim behavior and
      suitable reclaim_nodes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f14c889d
    • Nishanth Aravamudan's avatar
      hugetlb: ensure hugepage access is denied if hugepages are not supported · a32d8674
      Nishanth Aravamudan authored
      commit 457c1b27 upstream.
      
      Currently, I am seeing the following when I `mount -t hugetlbfs /none
      /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
      related to the fact that hugetlbfs is properly not correctly setting
      itself up in this state?:
      
        Unable to handle kernel paging request for data at address 0x00000031
        Faulting instruction address: 0xc000000000245710
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        ....
      
      In KVM guests on Power, in a guest not backed by hugepages, we see the
      following:
      
        AnonHugePages:         0 kB
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:         64 kB
      
      HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
      are not supported at boot-time, but this is only checked in
      hugetlb_init().  Extract the check to a helper function, and use it in a
      few relevant places.
      
      This does make hugetlbfs not supported (not registered at all) in this
      environment.  I believe this is fine, as there are no valid hugepages
      and that won't change at runtime.
      
      [akpm@linux-foundation.org: use pr_info(), per Mel]
      [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
      Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a32d8674
    • Hugh Dickins's avatar
      mm: fix bad rss-counter if remap_file_pages raced migration · 75858faa
      Hugh Dickins authored
      commit 88784396 upstream.
      
      Fix some "Bad rss-counter state" reports on exit, arising from the
      interaction between page migration and remap_file_pages(): zap_pte()
      must count a migration entry when zapping it.
      
      And yes, it is possible (though very unusual) to find an anon page or
      swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
      get_user_pages(write, force) case which COWs even in a shared mapping.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Tested-by: Sasha Levin sasha.levin@oracle.com>
      Tested-by: Dave Jones davej@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      75858faa
    • Han Pingtian's avatar
      mm: prevent setting of a value less than 0 to min_free_kbytes · 73210d0a
      Han Pingtian authored
      commit da8c757b upstream.
      
      If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang.  Changing
      proc_dointvec() to proc_dointvec_minmax() in the
      min_free_kbytes_sysctl_handler() can prevent this to happen.
      
      mhocko said:
      
      : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
      : your machine unusable but I agree that proc_dointvec_minmax is more
      : suitable here as we already have:
      :
      : 	.proc_handler   = min_free_kbytes_sysctl_handler,
      : 	.extra1         = &zero,
      :
      : It used to work properly but then 6fce56ec ("sysctl: Remove references
      : to ctl_name and strategy from the generic sysctl table") has removed
      : sysctl_intvec strategy and so extra1 is ignored.
      Signed-off-by: default avatarHan Pingtian <hanpt@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      73210d0a
    • Joonsoo Kim's avatar
      slab: correct pfmemalloc check · 3ddc614c
      Joonsoo Kim authored
      commit 73293c2f upstream.
      
      We checked pfmemalloc by slab unit, not page unit. You can see this
      in is_slab_pfmemalloc(). So other pages don't need to be set/cleared
      pfmemalloc.
      
      And, therefore we should check pfmemalloc in page flag of first page,
      but current implementation don't do that. virt_to_head_page(obj) just
      return 'struct page' of that object, not one of first page, since the SLAB
      don't use __GFP_COMP when CONFIG_MMU. To get 'struct page' of first page,
      we first get a slab and try to get it via virt_to_head_page(slab->s_mem).
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarPekka Enberg <penberg@iki.fi>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3ddc614c
    • Bob Liu's avatar
      mm: thp: khugepaged: add policy for finding target node · fc2dd02e
      Bob Liu authored
      commit 9f1b868a upstream.
      
      Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
      hugepage which is allocated from the node of the first scanned normal
      page, but this policy is too rough and may end with unexpected result to
      upper users.
      
      The problem is the original page-balancing among all nodes will be
      broken after hugepaged started.  Thinking about the case if the first
      scanned normal page is allocated from node A, most of other scanned
      normal pages are allocated from node B or C..  But hugepaged will always
      allocate hugepage from node A which will cause extra memory pressure on
      node A which is not the situation before khugepaged started.
      
      This patch try to fix this problem by making khugepaged allocate
      hugepage from the node which have max record of scaned normal pages hit,
      so that the effect to original page-balancing can be minimized.
      
      The other problem is if normal scanned pages are equally allocated from
      Node A,B and C, after khugepaged started Node A will still suffer extra
      memory pressure.
      
      Andrew Davidoff reported a related issue several days ago.  He wanted
      his application interleaving among all nodes and "numactl
      --interleave=all ./test" was used to run the testcase, but the result
      wasn't not as expected.
      
        cat /proc/2814/numa_maps:
        7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098
      
      The end result showed that most pages are from Node3 instead of
      interleave among node0-3 which was unreasonable.
      
      This patch also fix this issue by allocating hugepage round robin from
      all nodes have the same record, after this patch the result was as
      expected:
      
        7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722
      
      The simple testcase is like this:
      
      int main() {
      	char *p;
      	int i;
      	int j;
      
      	for (i=0; i < 200; i++) {
      		p = (char *)malloc(1048576);
      		printf("malloc done\n");
      
      		if (p == 0) {
      			printf("Out of memory\n");
      			return 1;
      		}
      		for (j=0; j < 1048576; j++) {
      			p[j] = 'A';
      		}
      		printf("touched memory\n");
      
      		sleep(1);
      	}
      	printf("enter sleep\n");
      	while(1) {
      		sleep(100);
      	}
      }
      
      [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
      Reported-by: default avatarAndrew Davidoff <davidoff@qedmf.net>
      Tested-by: default avatarAndrew Davidoff <davidoff@qedmf.net>
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fc2dd02e
    • Bob Liu's avatar
      mm: thp: cleanup: mv alloc_hugepage to better place · c3bd31a1
      Bob Liu authored
      commit 10dc4155 upstream.
      
      Move alloc_hugepage() to a better place, no need for a seperate #ifndef
      CONFIG_NUMA
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrew Davidoff <davidoff@qedmf.net>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c3bd31a1
    • Jiri Slaby's avatar
      Linux 3.12.29 · b45ddfa2
      Jiri Slaby authored
      b45ddfa2
    • Will Deacon's avatar
      arm64: flush TLS registers during exec · 7bcae251
      Will Deacon authored
      commit eb35bdd7 upstream.
      
      Nathan reports that we leak TLS information from the parent context
      during an exec, as we don't clear the TLS registers when flushing the
      thread state.
      
      This patch updates the flushing code so that we:
      
        (1) Unconditionally zero the tpidr_el0 register (since this is fully
            context switched for native tasks and zeroed for compat tasks)
      
        (2) Zero the tp_value state in thread_info before clearing the
            tpidrr0_el0 register for compat tasks (since this is only writable
            by the set_tls compat syscall and therefore not fully switched).
      
      A missing compiler barrier is also added to the compat set_tls syscall.
      Acked-by: default avatarNathan Lynch <Nathan_Lynch@mentor.com>
      Reported-by: default avatarNathan Lynch <Nathan_Lynch@mentor.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7bcae251
    • Jeff Moyer's avatar
      aio: add missing smp_rmb() in read_events_ring · 0fbdd4f7
      Jeff Moyer authored
      commit 2ff396be upstream.
      
      We ran into a case on ppc64 running mariadb where io_getevents would
      return zeroed out I/O events.  After adding instrumentation, it became
      clear that there was some missing synchronization between reading the
      tail pointer and the events themselves.  This small patch fixes the
      problem in testing.
      
      Thanks to Zach for helping to look into this, and suggesting the fix.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0fbdd4f7
    • Anton Blanchard's avatar
      ibmveth: Fix endian issues with rx_no_buffer statistic · e66983c5
      Anton Blanchard authored
      commit cbd52281 upstream.
      
      Hidden away in the last 8 bytes of the buffer_list page is a solitary
      statistic. It needs to be byte swapped or else ethtool -S will
      produce numbers that terrify the user.
      
      Since we do this in multiple places, create a helper function with a
      comment explaining what is going on.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e66983c5
    • Murali Karicheri's avatar
      ahci: add pcid for Marvel 0x9182 controller · be11da66
      Murali Karicheri authored
      commit c5edfff9 upstream.
      
      Keystone K2E EVM uses Marvel 0x9182 controller. This requires support
      for the ID in the ahci driver.
      Signed-off-by: default avatarMurali Karicheri <m-karicheri2@ti.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      be11da66
    • James Ralston's avatar
      ahci: Add Device IDs for Intel 9 Series PCH · 1b935cf4
      James Ralston authored
      commit 1b071a09 upstream.
      
      This patch adds the AHCI mode SATA Device IDs for the Intel 9 Series PCH.
      Signed-off-by: default avatarJames Ralston <james.d.ralston@intel.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1b935cf4
    • Arjun Sreedharan's avatar
      pata_scc: propagate return value of scc_wait_after_reset · 6cd2641b
      Arjun Sreedharan authored
      commit 4dc7c76c upstream.
      
      scc_bus_softreset not necessarily should return zero.
      Propagate the error code.
      Signed-off-by: default avatarArjun Sreedharan <arjun024@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6cd2641b
    • Tejun Heo's avatar
      libata: widen Crucial M550 blacklist matching · 61ee2622
      Tejun Heo authored
      commit 2a13772a upstream.
      
      Crucial M550 may cause data corruption on queued trims and is
      blacklisted.  The pattern used for it fails to match 1TB one as the
      capacity section will be four chars instead of three.  Widen the
      pattern.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarCharles Reiss <woggling@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81071Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      61ee2622
  2. 18 Sep, 2014 18 commits
  3. 17 Sep, 2014 3 commits
    • Sage Weil's avatar
      libceph: gracefully handle large reply messages from the mon · a1f3bee2
      Sage Weil authored
      commit 73c3d481 upstream.
      
      We preallocate a few of the message types we get back from the mon.  If we
      get a larger message than we are expecting, fall back to trying to allocate
      a new one instead of blindly using the one we have.
      Signed-off-by: default avatarSage Weil <sage@redhat.com>
      Reviewed-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a1f3bee2
    • Bart Van Assche's avatar
      IB/srp: Fix deadlock between host removal and multipathd · 2b84b406
      Bart Van Assche authored
      commit bcc05910 upstream.
      
      If scsi_remove_host() is invoked after a SCSI device has been blocked,
      if the fast_io_fail_tmo or dev_loss_tmo work gets scheduled on the
      workqueue executing srp_remove_work() and if an I/O request is
      scheduled after the SCSI device had been blocked by e.g. multipathd
      then the following deadlock can occur:
      
          kworker/6:1     D ffff880831f3c460     0   195      2 0x00000000
          Call Trace:
           [<ffffffff814aafd9>] schedule+0x29/0x70
           [<ffffffff814aa0ef>] schedule_timeout+0x10f/0x2a0
           [<ffffffff8105af6f>] msleep+0x2f/0x40
           [<ffffffff8123b0ae>] __blk_drain_queue+0x4e/0x180
           [<ffffffff8123d2d5>] blk_cleanup_queue+0x225/0x230
           [<ffffffffa0010732>] __scsi_remove_device+0x62/0xe0 [scsi_mod]
           [<ffffffffa000ed2f>] scsi_forget_host+0x6f/0x80 [scsi_mod]
           [<ffffffffa0002eba>] scsi_remove_host+0x7a/0x130 [scsi_mod]
           [<ffffffffa07cf5c5>] srp_remove_work+0x95/0x180 [ib_srp]
           [<ffffffff8106d7aa>] process_one_work+0x1ea/0x6c0
           [<ffffffff8106dd9b>] worker_thread+0x11b/0x3a0
           [<ffffffff810758bd>] kthread+0xed/0x110
           [<ffffffff814b972c>] ret_from_fork+0x7c/0xb0
          multipathd      D ffff880096acc460     0  5340      1 0x00000000
          Call Trace:
           [<ffffffff814aafd9>] schedule+0x29/0x70
           [<ffffffff814aa0ef>] schedule_timeout+0x10f/0x2a0
           [<ffffffff814ab79b>] io_schedule_timeout+0x9b/0xf0
           [<ffffffff814abe1c>] wait_for_completion_io_timeout+0xdc/0x110
           [<ffffffff81244b9b>] blk_execute_rq+0x9b/0x100
           [<ffffffff8124f665>] sg_io+0x1a5/0x450
           [<ffffffff8124fd21>] scsi_cmd_ioctl+0x2a1/0x430
           [<ffffffff8124fef2>] scsi_cmd_blk_ioctl+0x42/0x50
           [<ffffffffa00ec97e>] sd_ioctl+0xbe/0x140 [sd_mod]
           [<ffffffff8124bd04>] blkdev_ioctl+0x234/0x840
           [<ffffffff811cb491>] block_ioctl+0x41/0x50
           [<ffffffff811a0df0>] do_vfs_ioctl+0x300/0x520
           [<ffffffff811a1051>] SyS_ioctl+0x41/0x80
           [<ffffffff814b9962>] tracesys+0xd0/0xd5
      
      Fix this by scheduling removal work on another workqueue than the
      transport layer timers.
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Reviewed-by: default avatarDavid Dillow <dave@thedillows.org>
      Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2b84b406
    • Tejun Heo's avatar
      blkcg: don't call into policy draining if root_blkg is already gone · 77a8689a
      Tejun Heo authored
      commit 2a1b4cf2 upstream.
      
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reported-by: default avatarJet Chen <jet.chen@intel.com>
      Tested-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      77a8689a