1. 26 Jul, 2016 40 commits
    • Minchan Kim's avatar
      mm: use put_page() to free page instead of putback_lru_page() · c6c919eb
      Minchan Kim authored
      Recently, I got many reports about perfermance degradation in embedded
      system(Android mobile phone, webOS TV and so on) and easy fork fail.
      
      The problem was fragmentation caused by zram and GPU driver mainly.
      With memory pressure, their pages were spread out all of pageblock and
      it cannot be migrated with current compaction algorithm which supports
      only LRU pages.  In the end, compaction cannot work well so reclaimer
      shrinks all of working set pages.  It made system very slow and even to
      fail to fork easily which requires order-[2 or 3] allocations.
      
      Other pain point is that they cannot use CMA memory space so when OOM
      kill happens, I can see many free pages in CMA area, which is not memory
      efficient.  In our product which has big CMA memory, it reclaims zones
      too exccessively to allocate GPU and zram page although there are lots
      of free space in CMA so system becomes very slow easily.
      
      To solve these problem, this patch tries to add facility to migrate
      non-lru pages via introducing new functions and page flags to help
      migration.
      
      struct address_space_operations {
      	..
      	..
      	bool (*isolate_page)(struct page *, isolate_mode_t);
      	void (*putback_page)(struct page *);
      	..
      }
      
      new page flags
      
      	PG_movable
      	PG_isolated
      
      For details, please read description in "mm: migrate: support non-lru
      movable page migration".
      
      Originally, Gioh Kim had tried to support this feature but he moved so I
      took over the work.  I took many code from his work and changed a little
      bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to
      have many credit, too.
      
      And I should mention Chulmin who have tested this patchset heavily so I
      can find many bugs from him.  :)
      
      Thanks, Gioh, Konstantin and Chulmin!
      
      This patchset consists of five parts.
      
      1. clean up migration
        mm: use put_page to free page instead of putback_lru_page
      
      2. add non-lru page migration feature
        mm: migrate: support non-lru movable page migration
      
      3. rework KVM memory-ballooning
        mm: balloon: use general non-lru movable page feature
      
      4. zsmalloc refactoring for preparing page migration
        zsmalloc: keep max_object in size_class
        zsmalloc: use bit_spin_lock
        zsmalloc: use accessor
        zsmalloc: factor page chain functionality out
        zsmalloc: introduce zspage structure
        zsmalloc: separate free_zspage from putback_zspage
        zsmalloc: use freeobj for index
      
      5. zsmalloc page migration
        zsmalloc: page migration support
        zram: use __GFP_MOVABLE for memory allocation
      
      This patch (of 12):
      
      Procedure of page migration is as follows:
      
      First of all, it should isolate a page from LRU and try to migrate the
      page.  If it is successful, it releases the page for freeing.
      Otherwise, it should put the page back to LRU list.
      
      For LRU pages, we have used putback_lru_page for both freeing and
      putback to LRU list.  It's okay because put_page is aware of LRU list so
      if it releases last refcount of the page, it removes the page from LRU
      list.  However, It makes unnecessary operations (e.g., lru_cache_add,
      pagevec and flags operations.  It would be not significant but no worth
      to do) and harder to support new non-lru page migration because put_page
      isn't aware of non-lru page's data structure.
      
      To solve the problem, we can add new hook in put_page with PageMovable
      flags check but it can increase overhead in hot path and needs new
      locking scheme to stabilize the flag check with put_page.
      
      So, this patch cleans it up to divide two semantic(ie, put and putback).
      If migration is successful, use put_page instead of putback_lru_page and
      use putback_lru_page only on failure.  That makes code more readable and
      doesn't add overhead in put_page.
      
      Comment from Vlastimil
       "Yeah, and compaction (perhaps also other migration users) has to drain
        the lru pvec...  Getting rid of this stuff is worth even by itself."
      
      Link: http://lkml.kernel.org/r/1464736881-24886-2-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6c919eb
    • Sergey Senozhatsky's avatar
      zram: drop gfp_t from zcomp_strm_alloc() · 16d37725
      Sergey Senozhatsky authored
      We now allocate streams from CPU_UP hot-plug path, there are no
      context-dependent stream allocations anymore and we can schedule from
      zcomp_strm_alloc().  Use GFP_KERNEL directly and drop a gfp_t parameter.
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-9-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16d37725
    • Sergey Senozhatsky's avatar
      zram: add more compression algorithms · eb9f56d8
      Sergey Senozhatsky authored
      Add "deflate", "lz4hc", "842" algorithms to the list of known
      compression backends.  The real availability of those algorithms,
      however, depends on the corresponding CONFIG_CRYPTO_FOO config options.
      
      [sergey.senozhatsky@gmail.com: zram-add-more-compression-algorithms-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-7-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-8-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb9f56d8
    • Sergey Senozhatsky's avatar
      zram: delete custom lzo/lz4 · ce1ed9f9
      Sergey Senozhatsky authored
      Remove lzo/lz4 backends, we use crypto API now.
      
      [sergey.senozhatsky@gmail.com: zram-delete-custom-lzo-lz4-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-6-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-7-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce1ed9f9
    • Sergey Senozhatsky's avatar
      zram: cosmetic: cleanup documentation · 69a30a8d
      Sergey Senozhatsky authored
      zram documentation is a mix of different styles: spaces, tabs, tabs +
      spaces, etc.  Clean it up.
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-6-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69a30a8d
    • Sergey Senozhatsky's avatar
      zram: use crypto api to check alg availability · 415403be
      Sergey Senozhatsky authored
      There is no way to get a string with all the crypto comp algorithms
      supported by the crypto comp engine, so we need to maintain our own
      backends list.  At the same time we additionally need to use
      crypto_has_comp() to make sure that the user has requested a compression
      algorithm that is recognized by the crypto comp engine.  Relying on
      /proc/crypto is not an options here, because it does not show
      not-yet-inserted compression modules.
      
      Example:
      
       modprobe zram
       cat /proc/crypto | grep -i lz4
       modprobe lz4
       cat /proc/crypto | grep -i lz4
      name         : lz4
      driver       : lz4-generic
      module       : lz4
      
      So the user can't tell exactly if the lz4 is really supported from
      /proc/crypto output, unless someone or something has loaded it.
      
      This patch also adds crypto_has_comp() to zcomp_available_show().  We
      store all the compression algorithms names in zcomp's `backends' array,
      regardless the CONFIG_CRYPTO_FOO configuration, but show only those that
      are also supported by crypto engine.  This helps user to know the exact
      list of compression algorithms that can be used.
      
      Example:
        module lz4 is not loaded yet, but is supported by the crypto
        engine. /proc/crypto has no information on this module, while
        zram's `comp_algorithm' lists it:
      
       cat /proc/crypto | grep -i lz4
      
       cat /sys/block/zram0/comp_algorithm
      [lzo] lz4 deflate lz4hc 842
      
      We still use the `backends' array to determine if the requested
      compression backend is known to crypto api.  This array, however, may not
      contain some entries, therefore as the last step we call crypto_has_comp()
      function which attempts to insmod the requested compression algorithm to
      determine if crypto api supports it.  The advantage of this method is that
      now we permit the usage of out-of-tree crypto compression modules
      (implementing S/W or H/W compression).
      
      [sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      415403be
    • Sergey Senozhatsky's avatar
      zram: switch to crypto compress API · ebaf9ab5
      Sergey Senozhatsky authored
      We don't have an idle zstreams list anymore and our write path now works
      absolutely differently, preventing preemption during compression.  This
      removes possibilities of read paths preempting writes at wrong places
      (which could badly affect the performance of both paths) and at the same
      time opens the door for a move from custom LZO/LZ4 compression backends
      implementation to a more generic one, using crypto compress API.
      
      Joonsoo Kim [1] attempted to do this a while ago, but faced with the
      need of introducing a new crypto API interface.  The root cause was the
      fact that crypto API compression algorithms require a compression stream
      structure (in zram terminology) for both compression and decompression
      ops, while in reality only several of compression algorithms really need
      it.  This resulted in a concept of context-less crypto API compression
      backends [2].  Both write and read paths, though, would have been
      executed with the preemption enabled, which in the worst case could have
      resulted in a decreased worst-case performance, e.g.  consider the
      following case:
      
      	CPU0
      
      	zram_write()
      	  spin_lock()
      	    take the last idle stream
      	  spin_unlock()
      
      	<< preempted >>
      
      		zram_read()
      		  spin_lock()
      		   no idle streams
      			  spin_unlock()
      			  schedule()
      
      	resuming zram_write compression()
      
      but it took me some time to realize that, and it took even longer to
      evolve zram and to make it ready for crypto API.  The key turned out to be
      -- drop the idle streams list entirely.  Without the idle streams list we
      are free to use compression algorithms that require compression stream for
      decompression (read), because streams are now placed in per-cpu data and
      each write path has to disable preemption for compression op, almost
      completely eliminating the aforementioned case (technically, we still have
      a small chance, because write path has a fast and a slow paths and the
      slow path is executed with the preemption enabled; but the frequency of
      failed fast path is too low).
      
      TEST
      ====
      
      - 4 CPUs, x86_64 system
      - 3G zram, lzo
      - fio tests: read, randread, write, randwrite, rw, randrw
      
      test script [3] command:
       ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh
      
                         BASE           PATCHED
      jobs1
      READ:           2527.2MB/s	 2482.7MB/s
      READ:           2102.7MB/s	 2045.0MB/s
      WRITE:          1284.3MB/s	 1324.3MB/s
      WRITE:          1080.7MB/s	 1101.9MB/s
      READ:           430125KB/s	 437498KB/s
      WRITE:          430538KB/s	 437919KB/s
      READ:           399593KB/s	 403987KB/s
      WRITE:          399910KB/s	 404308KB/s
      jobs2
      READ:           8133.5MB/s	 7854.8MB/s
      READ:           7086.6MB/s	 6912.8MB/s
      WRITE:          3177.2MB/s	 3298.3MB/s
      WRITE:          2810.2MB/s	 2871.4MB/s
      READ:           1017.6MB/s	 1023.4MB/s
      WRITE:          1018.2MB/s	 1023.1MB/s
      READ:           977836KB/s	 984205KB/s
      WRITE:          979435KB/s	 985814KB/s
      jobs3
      READ:           13557MB/s	 13391MB/s
      READ:           11876MB/s	 11752MB/s
      WRITE:          4641.5MB/s	 4682.1MB/s
      WRITE:          4164.9MB/s	 4179.3MB/s
      READ:           1453.8MB/s	 1455.1MB/s
      WRITE:          1455.1MB/s	 1458.2MB/s
      READ:           1387.7MB/s	 1395.7MB/s
      WRITE:          1386.1MB/s	 1394.9MB/s
      jobs4
      READ:           20271MB/s	 20078MB/s
      READ:           18033MB/s	 17928MB/s
      WRITE:          6176.8MB/s	 6180.5MB/s
      WRITE:          5686.3MB/s	 5705.3MB/s
      READ:           2009.4MB/s	 2006.7MB/s
      WRITE:          2007.5MB/s	 2004.9MB/s
      READ:           1929.7MB/s	 1935.6MB/s
      WRITE:          1926.8MB/s	 1932.6MB/s
      jobs5
      READ:           18823MB/s	 19024MB/s
      READ:           18968MB/s	 19071MB/s
      WRITE:          6191.6MB/s	 6372.1MB/s
      WRITE:          5818.7MB/s	 5787.1MB/s
      READ:           2011.7MB/s	 1981.3MB/s
      WRITE:          2011.4MB/s	 1980.1MB/s
      READ:           1949.3MB/s	 1935.7MB/s
      WRITE:          1940.4MB/s	 1926.1MB/s
      jobs6
      READ:           21870MB/s	 21715MB/s
      READ:           19957MB/s	 19879MB/s
      WRITE:          6528.4MB/s	 6537.6MB/s
      WRITE:          6098.9MB/s	 6073.6MB/s
      READ:           2048.6MB/s	 2049.9MB/s
      WRITE:          2041.7MB/s	 2042.9MB/s
      READ:           2013.4MB/s	 1990.4MB/s
      WRITE:          2009.4MB/s	 1986.5MB/s
      jobs7
      READ:           21359MB/s	 21124MB/s
      READ:           19746MB/s	 19293MB/s
      WRITE:          6660.4MB/s	 6518.8MB/s
      WRITE:          6211.6MB/s	 6193.1MB/s
      READ:           2089.7MB/s	 2080.6MB/s
      WRITE:          2085.8MB/s	 2076.5MB/s
      READ:           2041.2MB/s	 2052.5MB/s
      WRITE:          2037.5MB/s	 2048.8MB/s
      jobs8
      READ:           20477MB/s	 19974MB/s
      READ:           18922MB/s	 18576MB/s
      WRITE:          6851.9MB/s	 6788.3MB/s
      WRITE:          6407.7MB/s	 6347.5MB/s
      READ:           2134.8MB/s	 2136.1MB/s
      WRITE:          2132.8MB/s	 2134.4MB/s
      READ:           2074.2MB/s	 2069.6MB/s
      WRITE:          2087.3MB/s	 2082.4MB/s
      jobs9
      READ:           19797MB/s	 19994MB/s
      READ:           18806MB/s	 18581MB/s
      WRITE:          6878.7MB/s	 6822.7MB/s
      WRITE:          6456.8MB/s	 6447.2MB/s
      READ:           2141.1MB/s	 2154.7MB/s
      WRITE:          2144.4MB/s	 2157.3MB/s
      READ:           2084.1MB/s	 2085.1MB/s
      WRITE:          2091.5MB/s	 2092.5MB/s
      jobs10
      READ:           19794MB/s	 19784MB/s
      READ:           18794MB/s	 18745MB/s
      WRITE:          6984.4MB/s	 6676.3MB/s
      WRITE:          6532.3MB/s	 6342.7MB/s
      READ:           2150.6MB/s	 2155.4MB/s
      WRITE:          2156.8MB/s	 2161.5MB/s
      READ:           2106.4MB/s	 2095.6MB/s
      WRITE:          2109.7MB/s	 2098.4MB/s
      
                                          BASE                       PATCHED
      jobs1                              perfstat
      stalled-cycles-frontend     102,480,595,419 (  41.53%)	  114,508,864,804 (  46.92%)
      stalled-cycles-backend       51,941,417,832 (  21.05%)	   46,836,112,388 (  19.19%)
      instructions                283,612,054,215 (    1.15)	  283,918,134,959 (    1.16)
      branches                     56,372,560,385 ( 724.923)	   56,449,814,753 ( 733.766)
      branch-misses                   374,826,000 (   0.66%)	      326,935,859 (   0.58%)
      jobs2                              perfstat
      stalled-cycles-frontend     155,142,745,777 (  40.99%)	  164,170,979,198 (  43.82%)
      stalled-cycles-backend       70,813,866,387 (  18.71%)	   66,456,858,165 (  17.74%)
      instructions                463,436,648,173 (    1.22)	  464,221,890,191 (    1.24)
      branches                     91,088,733,902 ( 760.088)	   91,278,144,546 ( 769.133)
      branch-misses                   504,460,363 (   0.55%)	      394,033,842 (   0.43%)
      jobs3                              perfstat
      stalled-cycles-frontend     201,300,397,212 (  39.84%)	  223,969,902,257 (  44.44%)
      stalled-cycles-backend       87,712,593,974 (  17.36%)	   81,618,888,712 (  16.19%)
      instructions                642,869,545,023 (    1.27)	  644,677,354,132 (    1.28)
      branches                    125,724,560,594 ( 690.682)	  126,133,159,521 ( 694.542)
      branch-misses                   527,941,798 (   0.42%)	      444,782,220 (   0.35%)
      jobs4                              perfstat
      stalled-cycles-frontend     246,701,197,429 (  38.12%)	  280,076,030,886 (  43.29%)
      stalled-cycles-backend      119,050,341,112 (  18.40%)	  110,955,641,671 (  17.15%)
      instructions                822,716,962,127 (    1.27)	  825,536,969,320 (    1.28)
      branches                    160,590,028,545 ( 688.614)	  161,152,996,915 ( 691.068)
      branch-misses                   650,295,287 (   0.40%)	      550,229,113 (   0.34%)
      jobs5                              perfstat
      stalled-cycles-frontend     298,958,462,516 (  38.30%)	  344,852,200,358 (  44.16%)
      stalled-cycles-backend      137,558,742,122 (  17.62%)	  129,465,067,102 (  16.58%)
      instructions              1,005,714,688,752 (    1.29)	1,007,657,999,432 (    1.29)
      branches                    195,988,773,962 ( 697.730)	  196,446,873,984 ( 700.319)
      branch-misses                   695,818,940 (   0.36%)	      624,823,263 (   0.32%)
      jobs6                              perfstat
      stalled-cycles-frontend     334,497,602,856 (  36.71%)	  387,590,419,779 (  42.38%)
      stalled-cycles-backend      163,539,365,335 (  17.95%)	  152,640,193,639 (  16.69%)
      instructions              1,184,738,177,851 (    1.30)	1,187,396,281,677 (    1.30)
      branches                    230,592,915,640 ( 702.902)	  231,253,802,882 ( 702.356)
      branch-misses                   747,934,786 (   0.32%)	      643,902,424 (   0.28%)
      jobs7                              perfstat
      stalled-cycles-frontend     396,724,684,187 (  37.71%)	  460,705,858,952 (  43.84%)
      stalled-cycles-backend      188,096,616,496 (  17.88%)	  175,785,787,036 (  16.73%)
      instructions              1,364,041,136,608 (    1.30)	1,366,689,075,112 (    1.30)
      branches                    265,253,096,936 ( 700.078)	  265,890,524,883 ( 702.839)
      branch-misses                   784,991,589 (   0.30%)	      729,196,689 (   0.27%)
      jobs8                              perfstat
      stalled-cycles-frontend     440,248,299,870 (  36.92%)	  509,554,793,816 (  42.46%)
      stalled-cycles-backend      222,575,930,616 (  18.67%)	  213,401,248,432 (  17.78%)
      instructions              1,542,262,045,114 (    1.29)	1,545,233,932,257 (    1.29)
      branches                    299,775,178,439 ( 697.666)	  300,528,458,505 ( 694.769)
      branch-misses                   847,496,084 (   0.28%)	      748,794,308 (   0.25%)
      jobs9                              perfstat
      stalled-cycles-frontend     506,269,882,480 (  37.86%)	  592,798,032,820 (  44.43%)
      stalled-cycles-backend      253,192,498,861 (  18.93%)	  233,727,666,185 (  17.52%)
      instructions              1,721,985,080,913 (    1.29)	1,724,666,236,005 (    1.29)
      branches                    334,517,360,255 ( 694.134)	  335,199,758,164 ( 697.131)
      branch-misses                   873,496,730 (   0.26%)	      815,379,236 (   0.24%)
      jobs10                             perfstat
      stalled-cycles-frontend     549,063,363,749 (  37.18%)	  651,302,376,662 (  43.61%)
      stalled-cycles-backend      281,680,986,810 (  19.07%)	  277,005,235,582 (  18.55%)
      instructions              1,901,859,271,180 (    1.29)	1,906,311,064,230 (    1.28)
      branches                    369,398,536,153 ( 694.004)	  370,527,696,358 ( 688.409)
      branch-misses                   967,929,335 (   0.26%)	      890,125,056 (   0.24%)
      
                                  BASE           PATCHED
      seconds elapsed        79.421641008	78.735285546
      seconds elapsed        61.471246133	60.869085949
      seconds elapsed        62.317058173	62.224188495
      seconds elapsed        60.030739363	60.081102518
      seconds elapsed        74.070398362	74.317582865
      seconds elapsed        84.985953007	85.414364176
      seconds elapsed        97.724553255	98.173311344
      seconds elapsed        109.488066758	110.268399318
      seconds elapsed        122.768189405	122.967164498
      seconds elapsed        135.130035105	136.934770801
      
      On my other system (8 x86_64 CPUs, short version of test results):
      
                                  BASE           PATCHED
      seconds elapsed        19.518065994	19.806320662
      seconds elapsed        15.172772749	15.594718291
      seconds elapsed        13.820925970	13.821708564
      seconds elapsed        13.293097816	14.585206405
      seconds elapsed        16.207284118	16.064431606
      seconds elapsed        17.958376158	17.771825767
      seconds elapsed        19.478009164	19.602961508
      seconds elapsed        21.347152811	21.352318709
      seconds elapsed        24.478121126	24.171088735
      seconds elapsed        26.865057442	26.767327618
      
      So performance-wise the numbers are quite similar.
      
      Also update zcomp interface to be more aligned with the crypto API.
      
      [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2
      [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2
      [3] https://github.com/sergey-senozhatsky/zram-perf-test
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: default avatarMinchan Kim <minchan@kernel.org>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebaf9ab5
    • Sergey Senozhatsky's avatar
      zram: rename zstrm find-release functions · 2aea8493
      Sergey Senozhatsky authored
      This has started as a 'add zlib support' work, but after some thinking I
      saw no blockers for a bigger change -- a switch to crypto API.
      
      We don't have an idle zstreams list anymore and our write path now works
      absolutely differently, preventing preemption during compression.  This
      removes possibilities of read paths preempting writes at wrong places
      and opens the door for a move from custom LZO/LZ4 compression backends
      implementation to a more generic one, using crypto compress API.
      
      This patch set also eliminates the need of a new context-less crypto API
      interface, which was quite hard to sell, so we can move along faster.
      
      benchmarks:
      
      (x86_64, 4GB, zram-perf script)
      
      perf reported run-time fio (max jobs=3).  I performed fio test with the
      increasing number of parallel jobs (max to 3) on a 3G zram device, using
      `static' data and the following crypto comp algorithms:
      
      	842, deflate, lz4, lz4hc, lzo
      
      the output was:
      
       - test running time (which can tell us what algorithms performs faster)
      
      and
      
       - zram mm_stat (which tells the compressed memory size, max used memory, etc).
      
      It's just for information.  for example, LZ4HC has twice the running
      time of LZO, but the compressed memory size is: 23592960 vs 34603008
      bytes.
      
        test-fio-zram-842
           197.907655282 seconds time elapsed
           201.623142884 seconds time elapsed
           226.854291345 seconds time elapsed
        test-fio-zram-DEFLATE
           253.259516155 seconds time elapsed
           258.148563401 seconds time elapsed
           290.251909365 seconds time elapsed
        test-fio-zram-LZ4
            27.022598717 seconds time elapsed
            29.580522717 seconds time elapsed
            33.293463430 seconds time elapsed
        test-fio-zram-LZ4HC
            56.393954615 seconds time elapsed
            74.904659747 seconds time elapsed
           101.940998564 seconds time elapsed
        test-fio-zram-LZO
            28.155948075 seconds time elapsed
            30.390036330 seconds time elapsed
            34.455773159 seconds time elapsed
      
      zram mm_stat-s (max fio jobs=3)
      
        test-fio-zram-842
        mm_stat (jobs1): 3221225472 673185792 690266112        0 690266112        0        0
        mm_stat (jobs2): 3221225472 673185792 690266112        0 690266112        0        0
        mm_stat (jobs3): 3221225472 673185792 690266112        0 690266112        0        0
        test-fio-zram-DEFLATE
        mm_stat (jobs1): 3221225472  24379392  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  24379392  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  24379392  37761024        0  37761024        0        0
        test-fio-zram-LZ4
        mm_stat (jobs1): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  23592960  37761024        0  37761024        0        0
        test-fio-zram-LZ4HC
        mm_stat (jobs1): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  23592960  37761024        0  37761024        0        0
        test-fio-zram-LZO
        mm_stat (jobs1): 3221225472  34603008  50335744        0  50335744        0        0
        mm_stat (jobs2): 3221225472  34603008  50335744        0  50335744        0        0
        mm_stat (jobs3): 3221225472  34603008  50335744        0  50339840        0        0
      
      This patch (of 8):
      
      We don't perform any zstream idle list lookup anymore, so
      zcomp_strm_find()/zcomp_strm_release() names are not representative.
      
      Rename to zcomp_stream_get()/zcomp_stream_put().
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-2-sergey.senozhatsky@gmail.comSigned-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2aea8493
    • Aneesh Kumar K.V's avatar
      powerpc/mm: check for irq disabled() only if DEBUG_VM is enabled · 9af3f56b
      Aneesh Kumar K.V authored
      We don't need to check this always.  The idea here is to capture the
      wrong usage of find_linux_pte_or_hugepte and we can do that by
      occasionally running with DEBUG_VM enabled.
      
      Link: http://lkml.kernel.org/r/1464692688-6612-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9af3f56b
    • Aneesh Kumar K.V's avatar
      include/linux/mmdebug.h: add VM_WARN which maps to WARN() · a54f9aeb
      Aneesh Kumar K.V authored
      This enables us to do VM_WARN(condition, "warn message");
      
      Link: http://lkml.kernel.org/r/1464692688-6612-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a54f9aeb
    • Vladimir Davydov's avatar
      mm: oom: add memcg to oom_control · 2a966b77
      Vladimir Davydov authored
      It's a part of oom context just like allocation order and nodemask, so
      let's move it to oom_control instead of passing it in the argument list.
      
      Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a966b77
    • Vladimir Davydov's avatar
      mm: zap ZONE_OOM_LOCKED · 798fd756
      Vladimir Davydov authored
      Not used since oom_lock was instroduced.
      
      Link: http://lkml.kernel.org/r/1464358093-22663-1-git-send-email-vdavydov@virtuozzo.comSigned-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      798fd756
    • Reza Arbab's avatar
      memory-hotplug: use zone_can_shift() for sysfs valid_zones attribute · a371d9f1
      Reza Arbab authored
      Since zone_can_shift() is being used to validate the target zone during
      onlining, it should also be used to determine the content of
      valid_zones.
      
      Link: http://lkml.kernel.org/r/1462816419-4479-4-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewd-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a371d9f1
    • Reza Arbab's avatar
      memory-hotplug: more general validation of zone during online · df429ac0
      Reza Arbab authored
      When memory is onlined, we are only able to rezone from ZONE_MOVABLE to
      ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE.
      
      To be more flexible, use the following criteria instead; to online
      memory from zone X into zone Y,
      
      * Any zones between X and Y must be unused.
      * If X is lower than Y, the onlined memory must lie at the end of X.
      * If X is higher than Y, the onlined memory must lie at the start of X.
      
      Add zone_can_shift() to make this determination.
      
      Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewd-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df429ac0
    • Reza Arbab's avatar
      memory-hotplug: add move_pfn_range() · e51e6c8f
      Reza Arbab authored
      Add move_pfn_range(), a wrapper to call move_pfn_range_left() or
      move_pfn_range_right().
      
      No functional change. This will be utilized by a later patch.
      
      Link: http://lkml.kernel.org/r/1462816419-4479-2-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e51e6c8f
    • Oliver O'Halloran's avatar
      mm/init: fix zone boundary creation · 90cae1fe
      Oliver O'Halloran authored
      As a part of memory initialisation the architecture passes an array to
      free_area_init_nodes() which specifies the max PFN of each memory zone.
      This array is not necessarily monotonic (due to unused zones) so this
      array is parsed to build monotonic lists of the min and max PFN for each
      zone.  ZONE_MOVABLE is special cased here as its limits are managed by
      the mm subsystem rather than the architecture.  Unfortunately, this
      special casing is broken when ZONE_MOVABLE is the not the last zone in
      the zone list.  The core of the issue is:
      
      	if (i == ZONE_MOVABLE)
      		continue;
      	arch_zone_lowest_possible_pfn[i] =
      		arch_zone_highest_possible_pfn[i-1];
      
      As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will
      be set to zero.  This patch fixes this bug by adding explicitly tracking
      where the next zone should start rather than relying on the contents
      arch_zone_highest_possible_pfn[].
      
      Thie is low priority.  To get bitten by this you need to enable a zone
      that appears after ZONE_MOVABLE in the zone_type enum.  As far as I can
      tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled,
      so I can't see this affecting too many people.
      
      I only noticed this because I've been fiddling with ZONE_DEVICE on
      powerpc and 4.6 broke my test kernel.  This bug, in conjunction with the
      changes in Taku Izumi's kernelcore=mirror patch (d91749c1) and
      powerpc being the odd architecture which initialises max_zone_pfn[] to
      ~0ul instead of 0 caused all of system memory to be placed into
      ZONE_DEVICE at boot, followed a panic since device memory cannot be used
      for kernel allocations.  I've already submitted a patch to fix the
      powerpc specific bits, but I figured this should be fixed too.
      
      Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.comSigned-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90cae1fe
    • Li RongQing's avatar
      mm/memcontrol.c: remove the useless parameter for mc_handle_swap_pte · 48406ef8
      Li RongQing authored
      It seems like this parameter has never been used since being introduced
      by 90254a65 ("memcg: clean up move charge").  Not a big deal because
      I assume the function would get inlined into the caller anyway but why
      not get rid of it.
      
      [mhocko@suse.com: wrote changelog]
        Link: http://lkml.kernel.org/r/20160525151831.GJ20132@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1464145026-26693-1-git-send-email-roy.qing.li@gmail.comSigned-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48406ef8
    • Wei Yongjun's avatar
      mm/slab: use list_move instead of list_del/list_add · de24baec
      Wei Yongjun authored
      Using list_move() instead of list_del() + list_add() to avoid needlessly
      poisoning the next and prev values.
      
      Link: http://lkml.kernel.org/r/1468929772-9174-1-git-send-email-weiyj_lk@163.comSigned-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de24baec
    • Alexey Dobriyan's avatar
      mm: faster kmalloc_array(), kcalloc() · 91c6a05f
      Alexey Dobriyan authored
      When both arguments to kmalloc_array() or kcalloc() are known at compile
      time then their product is known at compile time but search for kmalloc
      cache happens at runtime not at compile time.
      
      Link: http://lkml.kernel.org/r/20160627213454.GA2440@p183.telecom.bySigned-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91c6a05f
    • Michal Hocko's avatar
      slab: do not panic on invalid gfp_mask · 72baeef0
      Michal Hocko authored
      Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask.
      This is a rather harsh way to announce a non-critical issue.  Allocator
      is free to ignore invalid flags.  Let's simply replace BUG() by
      dump_stack to tell the offender and fixup the mask to move on with the
      allocation request.
      
      This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test
      module:
      
        Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code!
        CPU: 0 PID: 2916 Comm: insmod Tainted: G           O    4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
          dump_stack+0x67/0x90
          cache_alloc_refill+0x201/0x617
          kmem_cache_alloc_trace+0xa7/0x24a
          ? 0xffffffffa0005000
          mymodule_init+0x20/0x1000 [test_slab]
          do_one_initcall+0xe7/0x16c
          ? rcu_read_lock_sched_held+0x61/0x69
          ? kmem_cache_alloc_trace+0x197/0x24a
          do_init_module+0x5f/0x1d9
          load_module+0x1a3d/0x1f21
          ? retint_kernel+0x2d/0x2d
          SyS_init_module+0xe8/0x10e
          ? SyS_init_module+0xe8/0x10e
          do_syscall_64+0x68/0x13f
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72baeef0
    • Michal Hocko's avatar
      slab: make GFP_SLAB_BUG_MASK information more human readable · bacdcb34
      Michal Hocko authored
      printk offers %pGg for quite some time so let's use it to get a human
      readable list of invalid flags.
      
      The original output would be
        [  429.191962] gfp: 2
      
      after the change
        [  429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM)
      
      Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bacdcb34
    • Thomas Garnier's avatar
      mm: SLUB freelist randomization · 210e7a43
      Thomas Garnier authored
      Implements freelist randomization for the SLUB allocator.  It was
      previous implemented for the SLAB allocator.  Both use the same
      configuration option (CONFIG_SLAB_FREELIST_RANDOM).
      
      The list is randomized during initialization of a new set of pages.  The
      order on different freelist sizes is pre-computed at boot for
      performance.  Each kmem_cache has its own randomized freelist.
      
      This security feature reduces the predictability of the kernel SLUB
      allocator against heap overflows rendering attacks much less stable.
      
      For example these attacks exploit the predictability of the heap:
       - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
       - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
      
      Performance results:
      
      slab_test impact is between 3% to 4% on average for 100000 attempts
      without smp.  It is a very focused testing, kernbench show the overall
      impact on the system is way lower.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
        100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
        100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
        100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
        100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
        100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
        100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
        100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
        100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
        100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 70 cycles
        100000 times kmalloc(16)/kfree -> 70 cycles
        100000 times kmalloc(32)/kfree -> 70 cycles
        100000 times kmalloc(64)/kfree -> 70 cycles
        100000 times kmalloc(128)/kfree -> 70 cycles
        100000 times kmalloc(256)/kfree -> 69 cycles
        100000 times kmalloc(512)/kfree -> 70 cycles
        100000 times kmalloc(1024)/kfree -> 73 cycles
        100000 times kmalloc(2048)/kfree -> 72 cycles
        100000 times kmalloc(4096)/kfree -> 71 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
        100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
        100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
        100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
        100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
        100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
        100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
        100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
        100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
        100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 66 cycles
        100000 times kmalloc(16)/kfree -> 66 cycles
        100000 times kmalloc(32)/kfree -> 66 cycles
        100000 times kmalloc(64)/kfree -> 66 cycles
        100000 times kmalloc(128)/kfree -> 65 cycles
        100000 times kmalloc(256)/kfree -> 67 cycles
        100000 times kmalloc(512)/kfree -> 67 cycles
        100000 times kmalloc(1024)/kfree -> 64 cycles
        100000 times kmalloc(2048)/kfree -> 67 cycles
        100000 times kmalloc(4096)/kfree -> 67 cycles
      
      Kernbench, before:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 101.873 (1.16069)
        User Time 1045.22 (1.60447)
        System Time 88.969 (0.559195)
        Percent CPU 1112.9 (13.8279)
        Context Switches 189140 (2282.15)
        Sleeps 99008.6 (768.091)
      
      After:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 102.47 (0.562732)
        User Time 1045.3 (1.34263)
        System Time 88.311 (0.342554)
        Percent CPU 1105.8 (6.49444)
        Context Switches 189081 (2355.78)
        Sleeps 99231.5 (800.358)
      
      Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.comSigned-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      210e7a43
    • Thomas Garnier's avatar
      mm: reorganize SLAB freelist randomization · 7c00fce9
      Thomas Garnier authored
      The kernel heap allocators are using a sequential freelist making their
      allocation predictable.  This predictability makes kernel heap overflow
      easier to exploit.  An attacker can careful prepare the kernel heap to
      control the following chunk overflowed.
      
      For example these attacks exploit the predictability of the heap:
       - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
       - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
      
      ***Problems that needed solving:
       - Randomize the Freelist (singled linked) used in the SLUB allocator.
       - Ensure good performance to encourage usage.
       - Get best entropy in early boot stage.
      
      ***Parts:
       - 01/02 Reorganize the SLAB Freelist randomization to share elements
         with the SLUB implementation.
       - 02/02 The SLUB Freelist randomization implementation. Similar approach
         than the SLAB but tailored to the singled freelist used in SLUB.
      
      ***Performance data:
      
      slab_test impact is between 3% to 4% on average for 100000 attempts
      without smp.  It is a very focused testing, kernbench show the overall
      impact on the system is way lower.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
        100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
        100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
        100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
        100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
        100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
        100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
        100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
        100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
        100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 70 cycles
        100000 times kmalloc(16)/kfree -> 70 cycles
        100000 times kmalloc(32)/kfree -> 70 cycles
        100000 times kmalloc(64)/kfree -> 70 cycles
        100000 times kmalloc(128)/kfree -> 70 cycles
        100000 times kmalloc(256)/kfree -> 69 cycles
        100000 times kmalloc(512)/kfree -> 70 cycles
        100000 times kmalloc(1024)/kfree -> 73 cycles
        100000 times kmalloc(2048)/kfree -> 72 cycles
        100000 times kmalloc(4096)/kfree -> 71 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
        100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
        100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
        100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
        100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
        100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
        100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
        100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
        100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
        100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 66 cycles
        100000 times kmalloc(16)/kfree -> 66 cycles
        100000 times kmalloc(32)/kfree -> 66 cycles
        100000 times kmalloc(64)/kfree -> 66 cycles
        100000 times kmalloc(128)/kfree -> 65 cycles
        100000 times kmalloc(256)/kfree -> 67 cycles
        100000 times kmalloc(512)/kfree -> 67 cycles
        100000 times kmalloc(1024)/kfree -> 64 cycles
        100000 times kmalloc(2048)/kfree -> 67 cycles
        100000 times kmalloc(4096)/kfree -> 67 cycles
      
      Kernbench, before:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 101.873 (1.16069)
        User Time 1045.22 (1.60447)
        System Time 88.969 (0.559195)
        Percent CPU 1112.9 (13.8279)
        Context Switches 189140 (2282.15)
        Sleeps 99008.6 (768.091)
      
      After:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 102.47 (0.562732)
        User Time 1045.3 (1.34263)
        System Time 88.311 (0.342554)
        Percent CPU 1105.8 (6.49444)
        Context Switches 189081 (2355.78)
        Sleeps 99231.5 (800.358)
      
      This patch (of 2):
      
      This commit reorganizes the previous SLAB freelist randomization to
      prepare for the SLUB implementation.  It moves functions that will be
      shared to slab_common.
      
      The entropy functions are changed to align with the SLUB implementation,
      now using get_random_(int|long) functions.  These functions were chosen
      because they provide a bit more entropy early on boot and better
      performance when specific arch instructions are not available.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.comSigned-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c00fce9
    • Brian Foster's avatar
      fs/fs-writeback.c: inode writeback list tracking tracepoints · 9a46b04f
      Brian Foster authored
      The per-sb inode writeback list tracks inodes currently under writeback
      to facilitate efficient sync processing.  In particular, it ensures that
      sync only needs to walk through a list of inodes that were cleaned by
      the sync.
      
      Add a couple tracepoints to help identify when inodes are added/removed
      to and from the writeback lists.  Piggyback off of the writeback
      lazytime tracepoint template as it already tracks the relevant inode
      information.
      
      Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.comSigned-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dave Chinner <dchinner@redhat.com>
      cc: Josef Bacik <jbacik@fb.com>
      Cc: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a46b04f
    • Dave Chinner's avatar
      fs/fs-writeback.c: add a new writeback list for sync · 6c60d2b5
      Dave Chinner authored
      wait_sb_inodes() currently does a walk of all inodes in the filesystem
      to find dirty one to wait on during sync.  This is highly inefficient
      and wastes a lot of CPU when there are lots of clean cached inodes that
      we don't need to wait on.
      
      To avoid this "all inode" walk, we need to track inodes that are
      currently under writeback that we need to wait for.  We do this by
      adding inodes to a writeback list on the sb when the mapping is first
      tagged as having pages under writeback.  wait_sb_inodes() can then walk
      this list of "inodes under IO" and wait specifically just for the inodes
      that the current sync(2) needs to wait for.
      
      Define a couple helpers to add/remove an inode from the writeback list
      and call them when the overall mapping is tagged for or cleared from
      writeback.  Update wait_sb_inodes() to walk only the inodes under
      writeback due to the sync.
      
      With this change, filesystem sync times are significantly reduced for
      fs' with largely populated inode caches and otherwise no other work to
      do.  For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
      with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
      than 0.1s when the filesystem is fully clean.
      
      Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.comSigned-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarHolger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c60d2b5
    • piaojun's avatar
      ocfs2/cluster: clean up unnecessary assignment for 'ret' · 7d65b274
      piaojun authored
      Clean up unnecessary assignment for 'ret'.
      
      Link: http://lkml.kernel.org/r/578C61F6.4080403@huawei.comSigned-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d65b274
    • Joseph Qi's avatar
      ocfs2: remove obscure BUG_ON in dlmglue · e81f1c5c
      Joseph Qi authored
      These BUG_ON(!inode) are obscure because we have already used inode to
      get osb.  And actually we can guarantee here inode is valid in the
      context.  So we can safely remove them.
      
      Link: http://lkml.kernel.org/r/5776336A.6030104@huawei.comSigned-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarEric Ren <zren@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e81f1c5c
    • Joseph Qi's avatar
      ocfs2: cleanup implemented prototypes · 698d44b4
      Joseph Qi authored
      Several prototypes in inode.h are just defined but not actually
      implemented and used, so remove them.
      
      Link: http://lkml.kernel.org/r/57763787.4020706@huawei.comSigned-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      698d44b4
    • Joseph Qi's avatar
      ocfs2/dlm: fix memory leak of dlm_debug_ctxt · 8ec7b17a
      Joseph Qi authored
      dlm_debug_ctxt->debug_refcnt is initialized to 1 and then increased to 2
      by dlm_debug_get in dlm_debug_init.  But dlm_debug_put is called only
      once in dlm_debug_shutdown during unregister dlm, which leads to
      dlm_debug_ctxt leaked.
      
      Link: http://lkml.kernel.org/r/577BB755.4030900@huawei.comSigned-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarJiufei Xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ec7b17a
    • Joseph Qi's avatar
      ocfs2: cleanup unneeded goto in ocfs2_create_new_inode_locks · a8f24f1b
      Joseph Qi authored
      The last goto is unneeded, so remove it.
      
      Link: http://lkml.kernel.org/r/576213D3.6080002@huawei.comSigned-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8f24f1b
    • Junxiao Bi's avatar
      ocfs2: improve recovery performance · 0b492f68
      Junxiao Bi authored
      Journal replay will be run when performing recovery for a dead node.  To
      avoid the stale cache impact, all blocks of dead node's journal inode
      were reloaded from disk.  This hurts the performance.  Check whether one
      block is cached before reloading it can improve performance a lot.  In
      my test env, the time doing recovery was improved from 120s to 1s.
      
      [akpm@linux-foundation.org: clean up the for loop p_blkno handling]
      Link: http://lkml.kernel.org/r/1466155682-24656-1-git-send-email-junxiao.bi@oracle.comSigned-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: "Gang He" <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b492f68
    • Eric Ren's avatar
      ocfs2: fix a redundant re-initialization · 191df2b5
      Eric Ren authored
      Obviously, memset() has zeroed the whole struct locking_max_version.
      So, it's no need to zero its two fields individually.
      
      Link: http://lkml.kernel.org/r/1463970605-18354-1-git-send-email-zren@suse.comSigned-off-by: default avatarEric Ren <zren@suse.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      191df2b5
    • Randy Dunlap's avatar
      debugobjects.h: fix trivial kernel doc warning · 17359a80
      Randy Dunlap authored
      Add ':' to fix trivial kernel-doc warning in <linux/debugobjects.h>:
      
        ..//include/linux/debugobjects.h:63: warning: No description found for parameter 'is_static_object'
      
      Link: http://lkml.kernel.org/r/575B01B8.5060600@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17359a80
    • Sudip Mukherjee's avatar
      m32r: add __ucmpdi2 to fix build failure · a44ce523
      Sudip Mukherjee authored
      We are having build failure with m32r and the error message being:
      
        ERROR: "__ucmpdi2" [lib/842/842_decompress.ko] undefined!
        ERROR: "__ucmpdi2" [fs/btrfs/btrfs.ko] undefined!
        ERROR: "__ucmpdi2" [drivers/scsi/sd_mod.ko] undefined!
        ERROR: "__ucmpdi2" [drivers/media/i2c/adv7842.ko] undefined!
        ERROR: "__ucmpdi2" [drivers/md/bcache/bcache.ko] undefined!
        ERROR: "__ucmpdi2" [drivers/iio/imu/inv_mpu6050/inv-mpu6050.ko] undefined!
      
      __ucmpdi2 is introduced to m32r architecture taking example from other
      architectures like h8300, microblaze, mips.
      
      Link: http://lkml.kernel.org/r/1465509213-4280-1-git-send-email-sudipm.mukherjee@gmail.comSigned-off-by: default avatarSudip Mukherjee <sudip.mukherjee@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a44ce523
    • Riku Voipio's avatar
      scripts/bloat-o-meter: fix percent on <1% changes · 8cde0daf
      Riku Voipio authored
      Python divisions are integer divisions unless at least one parameter is
      a float.  The current bloat-o-meter fails to print sub-percentage
      changes:
      
        Total: Before=10515408, After=10604060, chg 0.000000%
      
      Force float division by using one float and pretty the print to two
      significant decimals:
      
        Total: Before=10515408, After=10604060, chg +0.84%
      
      Link: http://lkml.kernel.org/r/1465980311-23814-1-git-send-email-riku.voipio@linaro.orgSigned-off-by: default avatarRiku Voipio <riku.voipio@linaro.org>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cde0daf
    • Kees Cook's avatar
      kbuild: abort build on bad stack protector flag · c965b105
      Kees Cook authored
      Before, the stack protector flag was sanity checked before .config had
      been reprocessed.  This meant the build couldn't be aborted early, and
      only a warning could be emitted followed later by the compiler blowing
      up with an unknown flag.  This has caused a lot of confusion over time,
      so this splits the flag selection from sanity checking and performs the
      sanity checking after the make has been restarted from a reprocessed
      .config, so builds can be aborted as early as possible now.
      
      Additionally moves the x86-specific sanity check to the same location,
      since it suffered from the same warn-then-wait-for-compiler-failure
      problem.
      
      Link: http://lkml.kernel.org/r/20160712223043.GA11664@www.outflux.netSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c965b105
    • Arnd Bergmann's avatar
      fbmon: remove unused function argument · 3bd96463
      Arnd Bergmann authored
      When building with "make W=1", we get a warning about an empty stub
      function that does nothing but reassign its one of its arguments:
      
        drivers/video/fbdev/core/fbmon.c: In function 'fb_edid_to_monspecs':
        drivers/video/fbdev/core/fbmon.c:1497:67: error: parameter 'specs' set but not used [-Werror=unused-but-set-parameter]
      
      We can simply make that function completely empty to avoid the warning.
      
      This prevents a warning which everyone will see after "CFLAGS: add
      -Wunused-but-set-parameter" is merged.
      
      Link: http://lkml.kernel.org/r/20160715203229.1771162-1-arnd@arndb.deSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3bd96463
    • Stephen Boyd's avatar
      dma-debug: track bucket lock state for static checkers · d5dfc80f
      Stephen Boyd authored
      get_hash_bucket() and put_hash_bucket() acquire and release the same
      spinlock, but this confuses static checkers such as sparse
      
        lib/dma-debug.c:254:27: warning: context imbalance in 'get_hash_bucket' - wrong count at exit
        lib/dma-debug.c:268:13: warning: context imbalance in 'put_hash_bucket' - unexpected unlock
      
      Add the appropriate acquire and release statements so that checkers can
      properly track the lock state.
      
      Link: http://lkml.kernel.org/r/20160701191552.24295-1-sboyd@codeaurora.orgSigned-off-by: default avatarStephen Boyd <sboyd@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5dfc80f
    • Ross Zwisler's avatar
      dax: remote unused fault wrappers · 6b524995
      Ross Zwisler authored
      Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
      removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
      dax_pmd_fault() respectively, and update all callers.
      
      The dax_fault() and dax_pmd_fault() wrappers were initially intended to
      capture some filesystem independent functionality around page faults
      (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
      and ctime).
      
      However, the following commits:
      
         5726b27b ("ext2: Add locking for DAX faults")
         ea3d7209 ("ext4: fix races between page faults and hole punching")
      
      added locking to the ext2 and ext4 filesystems after these common
      operations but before __dax_fault() and __dax_pmd_fault() were called.
      This means that these wrappers are no longer used, and are unlikely to
      be used in the future.
      
      XFS has had locking analogous to what was recently added to ext2 and
      ext4 since DAX support was initially introduced by:
      
         6b698ede ("xfs: add DAX file operations support")
      
      Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b524995
    • Ross Zwisler's avatar
      dax: some small updates to dax.txt documentation · 221c7dc8
      Ross Zwisler authored
      These are originally from Matthew Wilcox and were part of his huge
      "mm,fs,dax: Change ->pmd_fault to ->huge_fault" patch that was part of
      PUD support.
      
      I'm breaking these small changes out as they stand on their own and add
      useful information to Documentation/filesystems/dax.txt.
      
      Link: http://lkml.kernel.org/r/20160714214049.20075-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      221c7dc8