1. 18 Mar, 2015 40 commits
    • Johannes Weiner's avatar
      mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change · 6c749954
      Johannes Weiner authored
      commit cc873177 upstream.
      
      Historically, !__GFP_FS allocations were not allowed to invoke the OOM
      killer once reclaim had failed, but nevertheless kept looping in the
      allocator.
      
      Commit 9879de73 ("mm: page_alloc: embed OOM killing naturally into
      allocation slowpath"), which should have been a simple cleanup patch,
      accidentally changed the behavior to aborting the allocation at that
      point.  This creates problems with filesystem callers (?) that currently
      rely on the allocator waiting for other tasks to intervene.
      
      Revert the behavior as it shouldn't have been changed as part of a
      cleanup patch.
      
      Fixes: 9879de73 ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Dave Chinner <david@fromorbit.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6c749954
    • Joonsoo Kim's avatar
      mm/nommu: fix memory leak · 51571a01
      Joonsoo Kim authored
      commit da616534 upstream.
      
      Maxime reported the following memory leak regression due to commit
      dbc8358c ("mm/nommu: use alloc_pages_exact() rather than its own
      implementation").
      
      On v3.19, I am facing a memory leak.  Each time I run a command one page
      is lost.  Here an example with busybox's free command:
      
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1972       5956          0          0        492
        -/+ buffers/cache:       1480       6448
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1976       5952          0          0        492
        -/+ buffers/cache:       1484       6444
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1980       5948          0          0        492
        -/+ buffers/cache:       1488       6440
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1984       5944          0          0        492
        -/+ buffers/cache:       1492       6436
        / # free
                     total       used       free     shared    buffers     cached
        Mem:          7928       1988       5940          0          0        492
        -/+ buffers/cache:       1496       6432
      
      At some point, the system fails to sastisfy 256KB allocations:
      
        free: page allocation failure: order:6, mode:0xd0
        CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
        Hardware name: STM32 (Device Tree Support)
          show_stack+0xb/0xc
          warn_alloc_failed+0x97/0xbc
          __alloc_pages_nodemask+0x295/0x35c
          __get_free_pages+0xb/0x24
          alloc_pages_exact+0x19/0x24
          do_mmap_pgoff+0x423/0x658
          vm_mmap_pgoff+0x3f/0x4e
          load_flat_file+0x20d/0x4f8
          load_flat_binary+0x3f/0x26c
          search_binary_handler+0x51/0xe4
          do_execveat_common+0x271/0x35c
          do_execve+0x19/0x1c
          ret_fast_syscall+0x1/0x4a
        Mem-info:
        Normal per-cpu:
        CPU    0: hi:    0, btch:   1 usd:   0
        active_anon:0 inactive_anon:0 isolated_anon:0
         active_file:0 inactive_file:0 isolated_file:0
         unevictable:123 dirty:0 writeback:0 unstable:0
         free:1515 slab_reclaimable:17 slab_unreclaimable:139
         mapped:0 shmem:0 pagetables:0 bounce:0
         free_cma:0
        Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
        lowmem_reserve[]: 0 0
        Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
        123 total pagecache pages
        2048 pages of RAM
        1538 free pages
        66 reserved pages
        109 slab pages
        -46 pages shared
        0 pages swap cached
        nommu: Allocation of length 221184 from process 67 (free) failed
        Normal per-cpu:
        CPU    0: hi:    0, btch:   1 usd:   0
        active_anon:0 inactive_anon:0 isolated_anon:0
         active_file:0 inactive_file:0 isolated_file:0
         unevictable:123 dirty:0 writeback:0 unstable:0
         free:1515 slab_reclaimable:17 slab_unreclaimable:139
         mapped:0 shmem:0 pagetables:0 bounce:0
         free_cma:0
        Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
        lowmem_reserve[]: 0 0
        Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
        123 total pagecache pages
        Unable to allocate RAM for process text/data, errno 12 SEGV
      
      This problem happens because we allocate ordered page through
      __get_free_pages() in do_mmap_private() in some cases and we try to free
      individual pages rather than ordered page in free_page_series().  In
      this case, freeing pages whose refcount is not 0 won't be freed to the
      page allocator so memory leak happens.
      
      To fix the problem, this patch changes __get_free_pages() to
      alloc_pages_exact() since alloc_pages_exact() returns
      physically-contiguous pages but each pages are refcounted.
      
      Fixes: dbc8358c ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
      Reported-by: default avatarMaxime Coquelin <mcoquelin.stm32@gmail.com>
      Tested-by: default avatarMaxime Coquelin <mcoquelin.stm32@gmail.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51571a01
    • Hugh Dickins's avatar
      mm: fix negative nr_isolated counts · 2cd12f3d
      Hugh Dickins authored
      commit ff59909a upstream.
      
      The vmstat interfaces are good at hiding negative counts (at least when
      CONFIG_SMP); but if you peer behind the curtain, you find that
      nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
      more negative: so they can absorb larger and larger numbers of isolated
      pages, yet still appear to be zero.
      
      I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
      but I guess it's there for a good reason, in which case we ought to get
      too_many_isolated() working again.
      
      The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
      putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
      to call acct_isolated() to increment them.
      
      It is possible that the bug whcih this patch fixes could cause OOM kills
      when the system still has a lot of reclaimable page cache.
      
      Fixes: edc2ca61 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2cd12f3d
    • Naoya Horiguchi's avatar
      mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page() · 1bab6ee0
      Naoya Horiguchi authored
      commit 9ab3b598 upstream.
      
      A race condition starts to be visible in recent mmotm, where a PG_hwpoison
      flag is set on a migration source page *before* it's back in buddy page
      poo= l.
      
      This is problematic because no page flag is supposed to be set when
      freeing (see __free_one_page().) So the user-visible effect of this race
      is that it could trigger the BUG_ON() when soft-offlining is called.
      
      The root cause is that we call lru_add_drain_all() to make sure that the
      page is in buddy, but that doesn't work because this function just
      schedule= s a work item and doesn't wait its completion.
      drain_all_pages() does drainin= g directly, so simply dropping
      lru_add_drain_all() solves this problem.
      
      Fixes: f15bdfa8 ("mm/memory-failure.c: fix memory leak in successful soft offlining")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1bab6ee0
    • Grazvydas Ignotas's avatar
      mm/memory.c: actually remap enough memory · 6f5468a7
      Grazvydas Ignotas authored
      commit 9cb12d7b upstream.
      
      For whatever reason, generic_access_phys() only remaps one page, but
      actually allows to access arbitrary size.  It's quite easy to trigger
      large reads, like printing out large structure with gdb, which leads to a
      crash.  Fix it by remapping correct size.
      
      Fixes: 28b2ee20 ("access_process_vm device memory infrastructure")
      Signed-off-by: default avatarGrazvydas Ignotas <notasas@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f5468a7
    • Joonsoo Kim's avatar
      mm/compaction: fix wrong order check in compact_finished() · cf4a7969
      Joonsoo Kim authored
      commit 372549c2 upstream.
      
      What we want to check here is whether there is highorder freepage in buddy
      list of other migratetype in order to steal it without fragmentation.
      But, current code just checks cc->order which means allocation request
      order.  So, this is wrong.
      
      Without this fix, non-movable synchronous compaction below pageblock order
      would not stopped until compaction is complete, because migratetype of
      most pageblocks are movable and high order freepage made by compaction is
      usually on movable type buddy list.
      
      There is some report related to this bug. See below link.
      
        http://www.spinics.net/lists/linux-mm/msg81666.html
      
      Although the issued system still has load spike comes from compaction,
      this makes that system completely stable and responsive according to his
      report.
      
      stress-highalloc test in mmtests with non movable order 7 allocation
      doesn't show any notable difference in allocation success rate, but, it
      shows more compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      18.47 : 28.94
      
      Fixes: 1fb3f8ca ("mm: compaction: capture a suitable high-order page immediately when it is made available")
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf4a7969
    • Roman Gushchin's avatar
      mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() · 225c2a35
      Roman Gushchin authored
      commit 8138a67a upstream.
      
      I noticed that "allowed" can easily overflow by falling below 0, because
      (total_vm / 32) can be larger than "allowed".  The problem occurs in
      OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      225c2a35
    • Roman Gushchin's avatar
      mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() · 00f4f16b
      Roman Gushchin authored
      commit 5703b087 upstream.
      
      I noticed, that "allowed" can easily overflow by falling below 0,
      because (total_vm / 32) can be larger than "allowed".  The problem
      occurs in OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      
      [akpm@linux-foundation.org: use min_t]
      Signed-off-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      00f4f16b
    • Vlastimil Babka's avatar
      mm: when stealing freepages, also take pages created by splitting buddy page · cdf47668
      Vlastimil Babka authored
      commit 99592d59 upstream.
      
      When studying page stealing, I noticed some weird looking decisions in
      try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
      following two patches were driven by evaluation.
      
      Testing was done with stress-highalloc of mmtests, using the
      mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
      often page stealing occurs for individual migratetypes, and what
      migratetypes are used for fallbacks.  Arguably, the worst case of page
      stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
      RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
      so the goal is to minimize these two cases.
      
      The evaluation of v2 wasn't always clear win and Joonsoo questioned the
      results.  Here I used different baseline which includes RFC compaction
      improvements from [1].  I found that the compaction improvements reduce
      variability of stress-highalloc, so there's less noise in the data.
      
      First, let's look at stress-highalloc configured to do sync compaction,
      and how these patches reduce page stealing events during the test.  First
      column is after fresh reboot, other two are reiterations of test without
      reboot.  That was all accumulater over 5 re-iterations (so the benchmark
      was run 5x3 times with 5 fresh restarts).
      
      Baseline:
      
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        5-nothp-1       5-nothp-2       5-nothp-3
      Page alloc extfrag event                               10264225     8702233    10244125
      Extfrag fragmenting                                    10263271     8701552    10243473
      Extfrag fragmenting for unmovable                         13595       17616       15960
      Extfrag fragmenting unmovable placed with movable          7989       12193        8447
      Extfrag fragmenting for reclaimable                         658        1840        1817
      Extfrag fragmenting reclaimable placed with movable         558        1677        1679
      Extfrag fragmenting for movable                        10249018     8682096    10225696
      
      With Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        6-nothp-1       6-nothp-2       6-nothp-3
      Page alloc extfrag event                               11834954     9877523     9774860
      Extfrag fragmenting                                    11833993     9876880     9774245
      Extfrag fragmenting for unmovable                          7342       16129       11712
      Extfrag fragmenting unmovable placed with movable          4191       10547        6270
      Extfrag fragmenting for reclaimable                         373        1130         923
      Extfrag fragmenting reclaimable placed with movable         302         906         738
      Extfrag fragmenting for movable                        11826278     9859621     9761610
      
      With Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        7-nothp-1       7-nothp-2       7-nothp-3
      Page alloc extfrag event                                4725990     3668793     3807436
      Extfrag fragmenting                                     4725104     3668252     3806898
      Extfrag fragmenting for unmovable                          6678        7974        7281
      Extfrag fragmenting unmovable placed with movable          2051        3829        4017
      Extfrag fragmenting for reclaimable                         429        1208        1278
      Extfrag fragmenting reclaimable placed with movable         369         976        1034
      Extfrag fragmenting for movable                         4717997     3659070     3798339
      
      With Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        8-nothp-1       8-nothp-2       8-nothp-3
      Page alloc extfrag event                                5016183     4700142     3850633
      Extfrag fragmenting                                     5015325     4699613     3850072
      Extfrag fragmenting for unmovable                          1312        3154        3088
      Extfrag fragmenting unmovable placed with movable          1115        2777        2714
      Extfrag fragmenting for reclaimable                         437        1193        1097
      Extfrag fragmenting reclaimable placed with movable         330         969         879
      Extfrag fragmenting for movable                         5013576     4695266     3845887
      
      In v2 we've seen apparent regression with Patch 1 for unmovable events,
      this is now gone, suggesting it was indeed noise.  Here, each patch
      improves the situation for unmovable events.  Reclaimable is improved by
      patch 1 and then either the same modulo noise, or perhaps sligtly worse -
      a small price for unmovable improvements, IMHO.  The number of movable
      allocations falling back to other migratetypes is most noisy, but it's
      reduced to half at Patch 2 nevertheless.  These are least critical as
      compaction can move them around.
      
      If we look at success rates, the patches don't affect them, that didn't change.
      
      Baseline:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  5-nothp-1             5-nothp-2             5-nothp-3
      Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
      Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
      Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
      Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
      Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
      Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
      Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
      Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 1:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  6-nothp-1             6-nothp-2             6-nothp-3
      Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
      Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
      Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
      Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
      Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
      Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
      Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)
      
      Patch 2:
      
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  7-nothp-1             7-nothp-2             7-nothp-3
      Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
      Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
      Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
      Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
      Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
      Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 3:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  8-nothp-1             8-nothp-2             8-nothp-3
      Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
      Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
      Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
      Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
      Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
      Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
      Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
      Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
      Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)
      
      While there's no improvement here, I consider reduced fragmentation events
      to be worth on its own.  Patch 2 also seems to reduce scanning for free
      pages, and migrations in compaction, suggesting it has somewhat less work
      to do:
      
      Patch 1:
      
      Compaction stalls                 4153        3959        3978
      Compaction success                1523        1441        1446
      Compaction failures               2630        2517        2531
      Page migrate success           4600827     4943120     5104348
      Page migrate failure             19763       16656       17806
      Compaction pages isolated      9597640    10305617    10653541
      Compaction migrate scanned    77828948    86533283    87137064
      Compaction free scanned      517758295   521312840   521462251
      Compaction cost                   5503        5932        6110
      
      Patch 2:
      
      Compaction stalls                 3800        3450        3518
      Compaction success                1421        1316        1317
      Compaction failures               2379        2134        2201
      Page migrate success           4160421     4502708     4752148
      Page migrate failure             19705       14340       14911
      Compaction pages isolated      8731983     9382374     9910043
      Compaction migrate scanned    98362797    96349194    98609686
      Compaction free scanned      496512560   469502017   480442545
      Compaction cost                   5173        5526        5811
      
      As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
      of unmovable and reclaimable pageblocks.
      
      Configuring the benchmark to allocate like THP page fault (i.e.  no sync
      compaction) gives much noisier results for iterations 2 and 3 after
      reboot.  This is not so surprising given how [1] offers lower improvements
      in this scenario due to less restarts after deferred compaction which
      would change compaction pivot.
      
      Baseline:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          5-thp-1         5-thp-2         5-thp-3
      Page alloc extfrag event                                8148965     6227815     6646741
      Extfrag fragmenting                                     8147872     6227130     6646117
      Extfrag fragmenting for unmovable                         10324       12942       15975
      Extfrag fragmenting unmovable placed with movable          5972        8495       10907
      Extfrag fragmenting for reclaimable                         601        1707        2210
      Extfrag fragmenting reclaimable placed with movable         520        1570        2000
      Extfrag fragmenting for movable                         8136947     6212481     6627932
      
      Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          6-thp-1         6-thp-2         6-thp-3
      Page alloc extfrag event                                8345457     7574471     7020419
      Extfrag fragmenting                                     8343546     7573777     7019718
      Extfrag fragmenting for unmovable                         10256       18535       30716
      Extfrag fragmenting unmovable placed with movable          6893       11726       22181
      Extfrag fragmenting for reclaimable                         465        1208        1023
      Extfrag fragmenting reclaimable placed with movable         353         996         843
      Extfrag fragmenting for movable                         8332825     7554034     6987979
      
      Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          7-thp-1         7-thp-2         7-thp-3
      Page alloc extfrag event                                3512847     3020756     2891625
      Extfrag fragmenting                                     3511940     3020185     2891059
      Extfrag fragmenting for unmovable                          9017        6892        6191
      Extfrag fragmenting unmovable placed with movable          1524        3053        2435
      Extfrag fragmenting for reclaimable                         445        1081        1160
      Extfrag fragmenting reclaimable placed with movable         375         918         986
      Extfrag fragmenting for movable                         3502478     3012212     2883708
      
      Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          8-thp-1         8-thp-2         8-thp-3
      Page alloc extfrag event                                3181699     3082881     2674164
      Extfrag fragmenting                                     3180812     3082303     2673611
      Extfrag fragmenting for unmovable                          1201        4031        4040
      Extfrag fragmenting unmovable placed with movable           974        3611        3645
      Extfrag fragmenting for reclaimable                         478        1165        1294
      Extfrag fragmenting reclaimable placed with movable         387         985        1030
      Extfrag fragmenting for movable                         3179133     3077107     2668277
      
      The improvements for first iteration are clear, the rest is much noisier
      and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.
      
      Allocation success rates are again unaffected so there's no point in
      making this e-mail any longer.
      
      [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
      
      This patch (of 3):
      
      When __rmqueue_fallback() is called to allocate a page of order X, it will
      find a page of order Y >= X of a fallback migratetype, which is different
      from the desired migratetype.  With the help of try_to_steal_freepages(),
      it may change the migratetype (to the desired one) also of:
      
      1) all currently free pages in the pageblock containing the fallback page
      2) the fallback pageblock itself
      3) buddy pages created by splitting the fallback page (when Y > X)
      
      These decisions take the order Y into account, as well as the desired
      migratetype, with the goal of preventing multiple fallback allocations
      that could e.g.  distribute UNMOVABLE allocations among multiple
      pageblocks.
      
      Originally, decision for 1) has implied the decision for 3).  Commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
      (probably unintentionally) so that the buddy pages in case 3) are always
      changed to the desired migratetype, except for CMA pageblocks.
      
      Commit fef903ef ("mm/page_allo.c: restructure free-page stealing code
      and fix a bug") did some refactoring and added a comment that the case of
      3) is intended.  Commit 0cbef29a ("mm: __rmqueue_fallback() should
      respect pageblock type") removed the comment and tried to restore the
      original behavior where 1) implies 3), but due to the previous
      refactoring, the result is instead that only 2) implies 3) - and the
      conditions for 2) are less frequently met than conditions for 1).  This
      may increase fragmentation in situations where the code decides to steal
      all free pages from the pageblock (case 1)), but then gives back the buddy
      pages produced by splitting.
      
      This patch restores the original intended logic where 1) implies 3).
      During testing with stress-highalloc from mmtests, this has shown to
      decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
      steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
      In some cases it has increased the number of events when MOVABLE
      allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
      fixable by sync compaction and thus less harmful.
      
      Note that evaluation has shown that the behavior introduced by
      47118af0 for buddy pages in case 3) is actually even better than the
      original logic, so the following patch will introduce it properly once
      again.  For stable backports of this patch it makes thus sense to only fix
      versions containing 0cbef29a.
      
      [iamjoonsoo.kim@lge.com: tracepoint fix]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdf47668
    • Andrey Ryabinin's avatar
      mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? · 89316deb
      Andrey Ryabinin authored
      commit 3cd7645d upstream.
      
      Commit ed4d4902 ("mm, hugetlb: remove hugetlb_zero and
      hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero'
      leading to out-of-bounds access in proc_doulongvec_minmax().  Use
      '.extra1 = NULL' instead of '.extra1 = &zero'.  Passing NULL is
      equivalent to passing minimal value, which is 0 for unsigned types.
      
      Fixes: ed4d4902 ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity")
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Suggested-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89316deb
    • Naoya Horiguchi's avatar
      mm/hugetlb: add migration entry check in __unmap_hugepage_range · 3ffc797a
      Naoya Horiguchi authored
      commit 9fbc1f63 upstream.
      
      If __unmap_hugepage_range() tries to unmap the address range over which
      hugepage migration is on the way, we get the wrong page because pte_page()
      doesn't work for migration entries.  This patch simply clears the pte for
      migration entries as we do for hwpoison entries.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ffc797a
    • Naoya Horiguchi's avatar
      mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection · 2c90c58c
      Naoya Horiguchi authored
      commit a8bda28d upstream.
      
      There is a race condition between hugepage migration and
      change_protection(), where hugetlb_change_protection() doesn't care about
      migration entries and wrongly overwrites them.  That causes unexpected
      results like kernel crash.  HWPoison entries also can cause the same
      problem.
      
      This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
      function to do proper actions.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c90c58c
    • Naoya Horiguchi's avatar
      mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() · 75809873
      Naoya Horiguchi authored
      commit 0f792cf9 upstream.
      
      When running the test which causes the race as shown in the previous patch,
      we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
      
      This race happens when pte turns into migration entry just after the first
      check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
      To fix this, we need to check pte_present() again after huge_ptep_get().
      
      This patch also reorders taking ptl and doing pte_page(), because
      pte_page() should be done in ptl.  Due to this reordering, we need use
      trylock_page() in page != pagecache_page case to respect locking order.
      
      Fixes: 66aebce7 ("hugetlb: fix race condition in hugetlb_fault()")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75809873
    • Jiri Pirko's avatar
      team: don't traverse port list using rcu in team_set_mac_address · dde0b1d5
      Jiri Pirko authored
      [ Upstream commit 9215f437 ]
      
      Currently the list is traversed using rcu variant. That is not correct
      since dev_set_mac_address can be called which eventually calls
      rtmsg_ifinfo_build_skb and there, skb allocation can sleep. So fix this
      by remove the rcu usage here.
      
      Fixes: 3d249d4c "net: introduce ethernet teaming device"
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dde0b1d5
    • Lorenzo Colitti's avatar
      net: ping: Return EAFNOSUPPORT when appropriate. · 2391f6b4
      Lorenzo Colitti authored
      [ Upstream commit 9145736d ]
      
      1. For an IPv4 ping socket, ping_check_bind_addr does not check
         the family of the socket address that's passed in. Instead,
         make it behave like inet_bind, which enforces either that the
         address family is AF_INET, or that the family is AF_UNSPEC and
         the address is 0.0.0.0.
      2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL
         if the socket family is not AF_INET6. Return EAFNOSUPPORT
         instead, for consistency with inet6_bind.
      3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT
         instead of EINVAL if an incorrect socket address structure is
         passed in.
      4. Make IPv6 ping sockets be IPv6-only. The code does not support
         IPv4, and it cannot easily be made to support IPv4 because
         the protocol numbers for ICMP and ICMPv6 are different. This
         makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead
         of making the socket unusable.
      
      Among other things, this fixes an oops that can be triggered by:
      
          int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
          struct sockaddr_in6 sin6 = {
              .sin6_family = AF_INET6,
              .sin6_addr = in6addr_any,
          };
          bind(s, (struct sockaddr *) &sin6, sizeof(sin6));
      
      Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2391f6b4
    • Michal Kubeček's avatar
      udp: only allow UFO for packets from SOCK_DGRAM sockets · 123303d4
      Michal Kubeček authored
      [ Upstream commit acf8dd0a ]
      
      If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a
      UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to
      CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer
      checksum is to be computed on segmentation. However, in this case,
      skb->csum_start and skb->csum_offset are never set as raw socket
      transmit path bypasses udp_send_skb() where they are usually set. As a
      result, driver may access invalid memory when trying to calculate the
      checksum and store the result (as observed in virtio_net driver).
      
      Moreover, the very idea of modifying the userspace provided UDP header
      is IMHO against raw socket semantics (I wasn't able to find a document
      clearly stating this or the opposite, though). And while allowing
      CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit
      too intrusive change just to handle a corner case like this. Therefore
      disallowing UFO for packets from SOCK_DGRAM seems to be the best option.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      123303d4
    • Ben Shelton's avatar
      usb: plusb: Add support for National Instruments host-to-host cable · da122a6b
      Ben Shelton authored
      [ Upstream commit 42c972a1 ]
      
      The National Instruments USB Host-to-Host Cable is based on the Prolific
      PL-25A1 chipset.  Add its VID/PID so the plusb driver will recognize it.
      Signed-off-by: default avatarBen Shelton <ben.shelton@ni.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      da122a6b
    • Eric Dumazet's avatar
      net: do not use rcu in rtnl_dump_ifinfo() · 25ba5bb0
      Eric Dumazet authored
      [ Upstream commit cac5e65e ]
      
      We did a failed attempt in the past to only use rcu in rtnl dump
      operations (commit e67f88dd "net: dont hold rtnl mutex during
      netlink dump callbacks")
      
      Now that dumps are holding RTNL anyway, there is no need to also
      use rcu locking, as it forbids any scheduling ability, like
      GFP_KERNEL allocations that controlling path should use instead
      of GFP_ATOMIC whenever possible.
      
      This should fix following splat Cong Wang reported :
      
       [ INFO: suspicious RCU usage. ]
       3.19.0+ #805 Tainted: G        W
      
       include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 1, debug_locks = 0
       2 locks held by ip/771:
        #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c
        #1:  (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e
      
       stack backtrace:
       CPU: 3 PID: 771 Comm: ip Tainted: G        W       3.19.0+ #805
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6
        ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd
        00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758
       Call Trace:
        [<ffffffff81a27457>] dump_stack+0x4c/0x65
        [<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110
        [<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47
        [<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb
        [<ffffffff8109e67d>] __might_sleep+0x78/0x80
        [<ffffffff814b9b1f>] idr_alloc+0x45/0xd1
        [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101
        [<ffffffff817c1383>] alloc_netid+0x61/0x69
        [<ffffffff817c14c3>] __peernet2id+0x79/0x8d
        [<ffffffff817c1ab7>] peernet2id+0x13/0x1f
        [<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20
        [<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52
        [<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213
        [<ffffffff8182b9c2>] netlink_dump+0xef/0x26c
        [<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5
        [<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59
        [<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51
        [<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9
        [<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d
        [<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37
        [<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72
        [<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
        [<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58
        [<ffffffff811ac556>] ? __fget_light+0x2d/0x52
        [<ffffffff817b376f>] __sys_recvmsg+0x42/0x60
        [<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 0c7aecd4 ("netns: add rtnl cmd to add and get peer netns ids")
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Reported-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25ba5bb0
    • Geert Uytterhoeven's avatar
      sh_eth: Fix lost MAC address on kexec · d97191b0
      Geert Uytterhoeven authored
      [ Upstream commit a14c7d15 ]
      
      Commit 740c7f31 ("sh_eth: Ensure DMA engines are stopped before
      freeing buffers") added a call to sh_eth_reset() to the
      sh_eth_set_ringparam() and sh_eth_close() paths.
      
      However, setting the software reset bit(s) in the EDMR register resets
      the MAC Address Registers to zero. Hence after kexec, the new kernel
      doesn't detect a valid MAC address and assigns a random MAC address,
      breaking DHCP.
      
      Set the MAC address again after the reset in sh_eth_dev_exit() to fix
      this.
      
      Tested on r8a7740/armadillo (GETHER) and r8a7791/koelsch (FAST_RCAR).
      
      Fixes: 740c7f31 ("sh_eth: Ensure DMA engines are stopped before freeing buffers")
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d97191b0
    • Florian Fainelli's avatar
      net: bcmgenet: fix software maintained statistics · 8e0f5ee1
      Florian Fainelli authored
      [ Upstream commit f62ba9c1 ]
      
      Commit 44c8bc3c ("net: bcmgenet: log RX buffer allocation and RX/TX dma
      failures") added a few software maintained statistics using
      BCMGENET_STAT_MIB_RX and BCMGENET_STAT_MIB_TX. These statistics are read from
      the hardware MIB counters, such that bcmgenet_update_mib_counters() was trying
      to read from a non-existing MIB offset for these counters.
      
      Fix this by introducing a special type: BCMGENET_STAT_SOFT, similar to
      BCMGENET_STAT_NETDEV, such that bcmgenet_get_ethtool_stats will read from the
      software mib.
      
      Fixes: 44c8bc3c ("net: bcmgenet: log RX buffer allocation and RX/TX dma failures")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e0f5ee1
    • Jaedon Shin's avatar
      net: bcmgenet: fix throughtput regression · 001f8cee
      Jaedon Shin authored
      [ Upstream commit 4092e6ac ]
      
      This patch adds bcmgenet_tx_poll for the tx_rings. This can reduce the
      interrupt load and send xmit in network stack on time. This also
      separated for the completion of tx_ring16 from bcmgenet_poll.
      
      The bcmgenet_tx_reclaim of tx_ring[{0,1,2,3}] operative by an interrupt
      is to be not more than a certain number TxBDs. It is caused by too
      slowly reclaiming the transmitted skb. Therefore, performance
      degradation of xmit after 605ad7f1 ("tcp: refine TSO autosizing").
      Signed-off-by: default avatarJaedon Shin <jaedon.shin@gmail.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      001f8cee
    • Eric Dumazet's avatar
      macvtap: make sure neighbour code can push ethernet header · 72e72674
      Eric Dumazet authored
      [ Upstream commit 2f1d8b9e ]
      
      Brian reported crashes using IPv6 traffic with macvtap/veth combo.
      
      I tracked the crashes in neigh_hh_output()
      
      -> memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD);
      
      Neighbour code assumes headroom to push Ethernet header is
      at least 16 bytes.
      
      It appears macvtap has only 14 bytes available on arches
      where NET_IP_ALIGN is 0 (like x86)
      
      Effect is a corruption of 2 bytes right before skb->head,
      and possible crashes if accessing non existing memory.
      
      This fix should also increase IPv4 performance, as paranoid code
      in ip_finish_output2() wont have to call skb_realloc_headroom()
      Reported-by: default avatarBrian Rak <brak@vultr.com>
      Tested-by: default avatarBrian Rak <brak@vultr.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72e72674
    • Catalin Marinas's avatar
      net: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msg · c9a44034
      Catalin Marinas authored
      [ Upstream commit d720d8ce ]
      
      With commit a7526eb5 (net: Unbreak compat_sys_{send,recv}msg), the
      MSG_CMSG_COMPAT flag is blocked at the compat syscall entry points,
      changing the kernel compat behaviour from the one before the commit it
      was trying to fix (1be374a0, net: Block MSG_CMSG_COMPAT in
      send(m)msg and recv(m)msg).
      
      On 32-bit kernels (!CONFIG_COMPAT), MSG_CMSG_COMPAT is 0 and the native
      32-bit sys_sendmsg() allows flag 0x80000000 to be set (it is ignored by
      the kernel). However, on a 64-bit kernel, the compat ABI is different
      with commit a7526eb5.
      
      This patch changes the compat_sys_{send,recv}msg behaviour to the one
      prior to commit 1be374a0.
      
      The problem was found running 32-bit LTP (sendmsg01) binary on an arm64
      kernel. Arguably, LTP should not pass 0xffffffff as flags to sendmsg()
      but the general rule is not to break user ABI (even when the user
      behaviour is not entirely sane).
      
      Fixes: a7526eb5 (net: Unbreak compat_sys_{send,recv}msg)
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9a44034
    • Jiri Pirko's avatar
      team: fix possible null pointer dereference in team_handle_frame · 82b92857
      Jiri Pirko authored
      [ Upstream commit 57e59563 ]
      
      Currently following race is possible in team:
      
      CPU0                                        CPU1
                                                  team_port_del
                                                    team_upper_dev_unlink
                                                      priv_flags &= ~IFF_TEAM_PORT
      team_handle_frame
        team_port_get_rcu
          team_port_exists
            priv_flags & IFF_TEAM_PORT == 0
          return NULL (instead of port got
                       from rx_handler_data)
                                                    netdev_rx_handler_unregister
      
      The thing is that the flag is removed before rx_handler is unregistered.
      If team_handle_frame is called in between, team_port_exists returns 0
      and team_port_get_rcu will return NULL.
      So do not check the flag here. It is guaranteed by netdev_rx_handler_unregister
      that team_handle_frame will always see valid rx_handler_data pointer.
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Fixes: 3d249d4c ("net: introduce ethernet teaming device")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82b92857
    • Eric Dumazet's avatar
      net: pktgen: disable xmit_clone on virtual devices · 37c63bf5
      Eric Dumazet authored
      [ Upstream commit 52d6c8c6 ]
      
      Trying to use burst capability (aka xmit_more) on a virtual device
      like bonding is not supported.
      
      For example, skb might be queued multiple times on a qdisc, with
      various list corruptions.
      
      Fixes: 38b2cf29 ("net: pktgen: packet bursting via skb->xmit_more")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37c63bf5
    • David S. Miller's avatar
      Revert "r8169: add support for Byte Queue Limits" · 9ae56d8b
      David S. Miller authored
      This reverts commit 1e918876.
      
      Revert BQL support in r8169 driver as several regressions
      point to this commit and we cannot figure out the real
      cause yet.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ae56d8b
    • Matthew Thode's avatar
      net: reject creation of netdev names with colons · 465146d2
      Matthew Thode authored
      [ Upstream commit a4176a93 ]
      
      colons are used as a separator in netdev device lookup in dev_ioctl.c
      
      Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME
      Signed-off-by: default avatarMatthew Thode <mthode@mthode.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      465146d2
    • Eric Dumazet's avatar
      sock: sock_dequeue_err_skb() needs hard irq safety · 9d0ba3cf
      Eric Dumazet authored
      [ Upstream commit 997d5c3f ]
      
      Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
      from hard IRQ context.
      
      Therefore, sock_dequeue_err_skb() needs to block hard irq or
      corruptions or hangs can happen.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: 364a9e93 ("sock: deduplicate errqueue dequeue")
      Fixes: cb820f8e ("net: Provide a generic socket error queue delivery method for Tx time stamps.")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d0ba3cf
    • Pravin B Shelar's avatar
      openvswitch: Fix net exit. · 488da593
      Pravin B Shelar authored
      [ Upstream commit 7b4577a9 ]
      
      Open vSwitch allows moving internal vport to different namespace
      while still connected to the bridge. But when namespace deleted
      OVS does not detach these vports, that results in dangling
      pointer to netdevice which causes kernel panic as follows.
      This issue is fixed by detaching all ovs ports from the deleted
      namespace at net-exit.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
      IP: [<ffffffffa0aadaa5>] ovs_vport_locate+0x35/0x80 [openvswitch]
      Oops: 0000 [#1] SMP
      Call Trace:
       [<ffffffffa0aa6391>] lookup_vport+0x21/0xd0 [openvswitch]
       [<ffffffffa0aa65f9>] ovs_vport_cmd_get+0x59/0xf0 [openvswitch]
       [<ffffffff8167e07c>] genl_family_rcv_msg+0x1bc/0x3e0
       [<ffffffff8167e319>] genl_rcv_msg+0x79/0xc0
       [<ffffffff8167d919>] netlink_rcv_skb+0xb9/0xe0
       [<ffffffff8167deac>] genl_rcv+0x2c/0x40
       [<ffffffff8167cffd>] netlink_unicast+0x12d/0x1c0
       [<ffffffff8167d3da>] netlink_sendmsg+0x34a/0x6b0
       [<ffffffff8162e140>] sock_sendmsg+0xa0/0xe0
       [<ffffffff8162e5e8>] ___sys_sendmsg+0x408/0x420
       [<ffffffff8162f541>] __sys_sendmsg+0x51/0x90
       [<ffffffff8162f592>] SyS_sendmsg+0x12/0x20
       [<ffffffff81764ee9>] system_call_fastpath+0x12/0x17
      Reported-by: default avatarAssaf Muller <amuller@redhat.com>
      Fixes: 46df7b81("openvswitch: Add support for network namespaces.")
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Reviewed-by: default avatarThomas Graf <tgraf@noironetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      488da593
    • Ignacy Gawędzki's avatar
      ematch: Fix auto-loading of ematch modules. · 7f77f6c8
      Ignacy Gawędzki authored
      [ Upstream commit 34eea79e ]
      
      In tcf_em_validate(), after calling request_module() to load the
      kind-specific module, set em->ops to NULL before returning -EAGAIN, so
      that module_put() is not called again by tcf_em_tree_destroy().
      Signed-off-by: default avatarIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7f77f6c8
    • Guenter Roeck's avatar
      net: phy: Fix verification of EEE support in phy_init_eee · 1dd8e324
      Guenter Roeck authored
      [ Upstream commit 54da5a8b ]
      
      phy_init_eee uses phy_find_setting(phydev->speed, phydev->duplex)
      to find a valid entry in the settings array for the given speed
      and duplex value. For full duplex 1000baseT, this will return
      the first matching entry, which is the entry for 1000baseKX_Full.
      
      If the phy eee does not support 1000baseKX_Full, this entry will not
      match, causing phy_init_eee to fail for no good reason.
      
      Fixes: 9a9c56cb ("net: phy: fix a bug when verify the EEE support")
      Fixes: 3e707706 ("phy: Expand phy speed/duplex settings array")
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1dd8e324
    • Alexander Drozdov's avatar
      ipv4: ip_check_defrag should not assume that skb_network_offset is zero · 8959499c
      Alexander Drozdov authored
      [ Upstream commit 3e32e733 ]
      
      ip_check_defrag() may be used by af_packet to defragment outgoing packets.
      skb_network_offset() of af_packet's outgoing packets is not zero.
      Signed-off-by: default avatarAlexander Drozdov <al.drozdov@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8959499c
    • Alexander Drozdov's avatar
      ipv4: ip_check_defrag should correctly check return value of skb_copy_bits · 1ccd26c9
      Alexander Drozdov authored
      [ Upstream commit fba04a9e ]
      
      skb_copy_bits() returns zero on success and negative value on error,
      so it is needed to invert the condition in ip_check_defrag().
      
      Fixes: 1bf3751e ("ipv4: ip_check_defrag must not modify skb before unsharing")
      Signed-off-by: default avatarAlexander Drozdov <al.drozdov@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ccd26c9
    • Ignacy Gawędzki's avatar
      gen_stats.c: Duplicate xstats buffer for later use · e67e4522
      Ignacy Gawędzki authored
      [ Upstream commit 1c4cff0c ]
      
      The gnet_stats_copy_app() function gets called, more often than not, with its
      second argument a pointer to an automatic variable in the caller's stack.
      Therefore, to avoid copying garbage afterwards when calling
      gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
      memory that gets freed after use.
      
      [xiyou.wangcong@gmail.com: remove a useless kfree()]
      Signed-off-by: default avatarIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e67e4522
    • WANG Cong's avatar
      rtnetlink: call ->dellink on failure when ->newlink exists · 2f3b6173
      WANG Cong authored
      [ Upstream commit 7afb8886 ]
      
      Ignacy reported that when eth0 is down and add a vlan device
      on top of it like:
      
        ip link add link eth0 name eth0.1 up type vlan id 1
      
      We will get a refcount leak:
      
        unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2
      
      The problem is when rtnl_configure_link() fails in rtnl_newlink(),
      we simply call unregister_device(), but for stacked device like vlan,
      we almost do nothing when we unregister the upper device, more work
      is done when we unregister the lower device, so call its ->dellink().
      Reported-by: default avatarIgnacy Gawedzki <ignacy.gawedzki@green-communications.fr>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f3b6173
    • Martin KaFai Lau's avatar
      ipv6: fix ipv6_cow_metrics for non DST_HOST case · 29ec7670
      Martin KaFai Lau authored
      [ Upstream commit 3b471175 ]
      
      ipv6_cow_metrics() currently assumes only DST_HOST routes require
      dynamic metrics allocation from inetpeer.  The assumption breaks
      when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric.
      Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set()
      is called after the route is created.
      
      This patch creates the metrics array (by calling dst_cow_metrics_generic) in
      ipv6_cow_metrics().
      
      Test:
      radvd.conf:
      interface qemubr0
      {
      	AdvLinkMTU 1300;
      	AdvCurHopLimit 30;
      
      	prefix fd00:face:face:face::/64
      	{
      		AdvOnLink on;
      		AdvAutonomous on;
      		AdvRouterAddr off;
      	};
      };
      
      Before:
      [root@qemu1 ~]# ip -6 r show | egrep -v unreachable
      fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec
      fe80::/64 dev eth0  proto kernel  metric 256
      default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec
      
      After:
      [root@qemu1 ~]# ip -6 r show | egrep -v unreachable
      fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec mtu 1300
      fe80::/64 dev eth0  proto kernel  metric 256  mtu 1300
      default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec mtu 1300 hoplimit 30
      
      Fixes: 8e2ec639 (ipv6: don't use inetpeer to store metrics for routes.)
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29ec7670
    • Eric Dumazet's avatar
      tcp: make sure skb is not shared before using skb_get() · b73f6e47
      Eric Dumazet authored
      [ Upstream commit ba34e6d9 ]
      
      IPv6 can keep a copy of SYN message using skb_get() in
      tcp_v6_conn_request() so that caller wont free the skb when calling
      kfree_skb() later.
      
      Therefore TCP fast open has to clone the skb it is queuing in
      child->sk_receive_queue, as all skbs consumed from receive_queue are
      freed using __kfree_skb() (ie assuming skb->users == 1)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Fixes: 5b7ed089 ("tcp: move fastopen functions to tcp_fastopen.c")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b73f6e47
    • Vlad Yasevich's avatar
      ipv6: Make __ipv6_select_ident static · 9519ba74
      Vlad Yasevich authored
      [ Upstream commit 8381eacf ]
      
      Make __ipv6_select_ident() static as it isn't used outside
      the file.
      
      Fixes: 0508c07f (ipv6: Select fragment id during UFO segmentation if not set.)
      Signed-off-by: default avatarVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9519ba74
    • Vlad Yasevich's avatar
      ipv6: Fix fragment id assignment on LE arches. · aee61469
      Vlad Yasevich authored
      [ Upstream commit 51f30770 ]
      
      Recent commit:
      0508c07f
      Author: Vlad Yasevich <vyasevich@gmail.com>
      Date:   Tue Feb 3 16:36:15 2015 -0500
      
          ipv6: Select fragment id during UFO segmentation if not set.
      
      Introduced a bug on LE in how ipv6 fragment id is assigned.
      This was cought by nightly sparce check:
      
      Resolve the following sparce error:
       net/ipv6/output_core.c:57:38: sparse: incorrect type in assignment
       (different base types)
         net/ipv6/output_core.c:57:38:    expected restricted __be32
      [usertype] ip6_frag_id
         net/ipv6/output_core.c:57:38:    got unsigned int [unsigned]
      [assigned] [usertype] id
      
      Fixes: 0508c07f (ipv6: Select fragment id during UFO segmentation if not set.)
      Signed-off-by: default avatarVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aee61469
    • Daniel Borkmann's avatar
      rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY · 4f630d42
      Daniel Borkmann authored
      [ Upstream commit 364d5716 ]
      
      ifla_vf_policy[] is wrong in advertising its individual member types as
      NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
      len member as *max* attribute length [0, len].
      
      The issue is that when do_setvfinfo() is being called to set up a VF
      through ndo handler, we could set corrupted data if the attribute length
      is less than the size of the related structure itself.
      
      The intent is exactly the opposite, namely to make sure to pass at least
      data of minimum size of len.
      
      Fixes: ebc08a6f ("rtnetlink: Add VF config code to rtnetlink")
      Cc: Mitch Williams <mitch.a.williams@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f630d42