1. 26 Apr, 2024 25 commits
    • Donet Tom's avatar
      mm/mempolicy: use numa_node_id() instead of cpu_to_node() · f8fd525b
      Donet Tom authored
      Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY
      policy:, v4.
      
      This patchset is to optimize the cross-socket memory access with
      MPOL_PREFERRED_MANY policy.
      
      To test this patch we ran the following test on a 3 node system.
       Node 0 - 2GB   - Tier 1
       Node 1 - 11GB  - Tier 1
       Node 6 - 10GB  - Tier 2
      
      Below changes are made to memcached to set the memory policy,
      It select Node0 and Node1 as preferred nodes.
      
         #include <numaif.h>
         #include <numa.h>
      
          unsigned long nodemask;
          int ret;
      
          nodemask = 0x03;
          ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
                                                     &nodemask, 10);
          /* If MPOL_F_NUMA_BALANCING isn't supported,
           * fall back to MPOL_PREFERRED_MANY */
          if (ret < 0 && errno == EINVAL){
             printf("set mem policy normal\n");
              ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
          }
          if (ret < 0) {
             perror("Failed to call set_mempolicy");
             exit(-1);
          }
      
      Test Procedure:
      ===============
      1. Make sure memory tiring and demotion are enabled.
      2. Start memcached.
      
         # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
             -d -s "/tmp/memcached.sock"
      
      3. Run memtier_benchmark to store 3200000 keys.
      
        #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
          --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
          --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024
      
      4. Start a memory eater on node 0 and 1. This will demote all memcached
         pages to node 6.
      5. Make sure all the memcached pages got demoted to lower tier by reading
         /proc/<memcaced PID>/numa_maps.
      
          # cat /proc/2771/numa_maps
           ---
          default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
          default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
           ---
      
      6. Kill memory eater.
      7. Read the pgpromote_success counter.
      8. Start reading the keys by running memtier_benchmark.
      
        #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
         --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
         --key-minimum=1 --key-maximum=3200000 -n allkeys
         --threads=64 -c 1 -R -x 6
      
      9. Read the pgpromote_success counter.
      
      Test Results:
      =============
      Without Patch
      ------------------
      1. pgpromote_success  before test
      Node 0:  pgpromote_success 11
      Node 1:  pgpromote_success 140974
      
      pgpromote_success  after test
      Node 0:  pgpromote_success 11
      Node 1:  pgpromote_success 140974
      
      2. Memtier-benchmark result.
      AGGREGATED AVERAGE RESULTS (6 runs)
      ==================================================================
      Type    Ops/sec   Hits/sec   Misses/sec  Avg. Latency  p50 Latency
      ------------------------------------------------------------------
      Sets     0.00       ---         ---        ---          ---
      Gets    305792.03  305791.93   0.10       0.18949       0.16700
      Waits    0.00       ---         ---        ---          ---
      Totals  305792.03  305791.93   0.10       0.18949       0.16700
      
      ======================================
      p99 Latency  p99.9 Latency  KB/sec
      -------------------------------------
      ---          ---            0.00
      0.44700     1.71100        11542.69
      ---           ---            ---
      0.44700     1.71100        11542.69
      
      With Patch
      ---------------
      1. pgpromote_success  before test
      Node 0:  pgpromote_success 5
      Node 1:  pgpromote_success 89386
      
      pgpromote_success  after test
      Node 0:  pgpromote_success 57895
      Node 1:  pgpromote_success 141463
      
      2. Memtier-benchmark result.
      AGGREGATED AVERAGE RESULTS (6 runs)
      ====================================================================
      Type    Ops/sec    Hits/sec  Misses/sec  Avg. Latency  p50 Latency
      --------------------------------------------------------------------
      Sets     0.00        ---       ---        ---           ---
      Gets    521942.24  521942.07  0.17       0.11459        0.10300
      Waits    0.00        ---       ---         ---          ---
      Totals  521942.24  521942.07  0.17       0.11459        0.10300
      
      =======================================
      p99 Latency  p99.9 Latency  KB/sec
      ---------------------------------------
       ---          ---            0.00
      0.23100      0.31900        19701.68
      ---          ---             ---
      0.23100      0.31900        19701.68
      
      
      Test Result Analysis:
      =====================
      1. With patch we could observe pages are getting promoted.
      2. Memtier-benchmark results shows that, with the patch,
         performance has increased more than 50%.
      
       Ops/sec without fix -  305792.03
       Ops/sec with fix    -  521942.24
      
      
      This patch (of 2):
      
      Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is
      quicker.  smp_processor_id is guaranteed to be stable in the
      'mpol_misplaced()' function because it is called with ptl held. 
      lockdep_assert_held was added to ensure that.
      
      No functional change in this patch.
      
      [donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment]
        Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Signed-off-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8fd525b
    • Yosry Ahmed's avatar
      mm: zswap: remove unnecessary check in zswap_find_zpool() · fea68a75
      Yosry Ahmed authored
      zswap_find_zpool() checks if ZSWAP_NR_ZPOOLS > 1, which is always true. 
      This is a remnant from a patch version that had ZSWAP_NR_ZPOOLS as a
      config option and never made it upstream.  Remove the unnecessary check.
      
      Link: https://lkml.kernel.org/r/20240311235210.2937484-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fea68a75
    • Duoming Zhou's avatar
      lib/test_hmm.c: handle src_pfns and dst_pfns allocation failure · c2af060d
      Duoming Zhou authored
      The kcalloc() in dmirror_device_evict_chunk() will return null if the
      physical memory has run out.  As a result, if src_pfns or dst_pfns is
      dereferenced, the null pointer dereference bug will happen.
      
      Moreover, the device is going away.  If the kcalloc() fails, the pages
      mapping a chunk could not be evicted.  So add a __GFP_NOFAIL flag in
      kcalloc().
      
      Finally, as there is no need to have physically contiguous memory, Switch
      kcalloc() to kvcalloc() in order to avoid failing allocations.
      
      Link: https://lkml.kernel.org/r/20240312005905.9939-1-duoming@zju.edu.cn
      Fixes: b2ef9f5a ("mm/hmm/test: add selftest driver for HMM")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2af060d
    • Johannes Weiner's avatar
      mm: zpool: return pool size in pages · 4196b48d
      Johannes Weiner authored
      All zswap backends track their pool sizes in pages.  Currently they
      multiply by PAGE_SIZE for zswap, only for zswap to divide again in order
      to do limit math.  Report pages directly.
      
      Link: https://lkml.kernel.org/r/20240312153901.3441-2-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4196b48d
    • Johannes Weiner's avatar
      mm: zswap: optimize zswap pool size tracking · 91cdcd8d
      Johannes Weiner authored
      Profiling the munmap() of a zswapped memory region shows 60% of the total
      cycles currently going into updating the zswap_pool_total_size.
      
      There are three consumers of this counter:
      - store, to enforce the globally configured pool limit
      - meminfo & debugfs, to report the size to the user
      - shrink, to determine the batch size for each cycle
      
      Instead of aggregating everytime an entry enters or exits the zswap
      pool, aggregate the value from the zpools on-demand:
      
      - Stores aggregate the counter anyway upon success. Aggregating to
        check the limit instead is the same amount of work.
      
      - Meminfo & debugfs might benefit somewhat from a pre-aggregated
        counter, but aren't exactly hotpaths.
      
      - Shrinking can aggregate once for every cycle instead of doing it for
        every freed entry. As the shrinker might work on tens or hundreds of
        objects per scan cycle, this is a large reduction in aggregations.
      
      The paths that benefit dramatically are swapin, swapoff, and unmaps. 
      There could be millions of pages being processed until somebody asks for
      the pool size again.  This eliminates the pool size updates from those
      paths entirely.
      
      Top profile entries for a 24G range munmap(), before:
      
          38.54%  zswap-unmap  [kernel.kallsyms]  [k] zs_zpool_total_size
          12.51%  zswap-unmap  [kernel.kallsyms]  [k] zpool_get_total_size
           9.10%  zswap-unmap  [kernel.kallsyms]  [k] zswap_update_total_size
           2.95%  zswap-unmap  [kernel.kallsyms]  [k] obj_cgroup_uncharge_zswap
           2.88%  zswap-unmap  [kernel.kallsyms]  [k] __slab_free
           2.86%  zswap-unmap  [kernel.kallsyms]  [k] xas_store
      
      and after:
      
           7.70%  zswap-unmap  [kernel.kallsyms]  [k] __slab_free
           7.16%  zswap-unmap  [kernel.kallsyms]  [k] obj_cgroup_uncharge_zswap
           6.74%  zswap-unmap  [kernel.kallsyms]  [k] xas_store
      
      It was also briefly considered to move to a single atomic in zswap
      that is updated by the backends, since zswap only cares about the sum
      of all pools anyway. However, zram directly needs per-pool information
      out of zsmalloc. To keep the backend from having to update two atomics
      every time, I opted for the lazy aggregation instead for now.
      
      Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91cdcd8d
    • Peter Xu's avatar
      mm: document pXd_leaf() API · 64078b3d
      Peter Xu authored
      There's one small section already, but since we're going to remove
      pXd_huge(), that comment may start to obsolete.
      
      Rewrite that section with more information, hopefully with that the API is
      crystal clear on what it implies.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-15-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64078b3d
    • Peter Xu's avatar
      mm/arm: remove pmd_thp_or_huge() · 502016e3
      Peter Xu authored
      ARM/ARM64 used to define pmd_thp_or_huge().  Now this macro is completely
      redundant.  Remove it and use pmd_leaf().
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-14-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      502016e3
    • Peter Xu's avatar
      mm/treewide: remove pXd_huge() · 9636f055
      Peter Xu authored
      This API is not used anymore, drop it for the whole tree.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-13-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9636f055
    • Peter Xu's avatar
      mm/treewide: replace pXd_huge() with pXd_leaf() · 1965e933
      Peter Xu authored
      Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
      reuse it.  Luckily, pXd_huge() isn't widely used.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1965e933
    • Peter Xu's avatar
      mm/gup: merge pXd huge mapping checks · 7db86dc3
      Peter Xu authored
      Huge mapping checks in GUP are slightly redundant and can be simplified.
      
      pXd_huge() now is the same as pXd_leaf().  pmd_trans_huge() and
      pXd_devmap() should both imply pXd_leaf(). Time to merge them into one.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-11-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7db86dc3
    • Peter Xu's avatar
      mm/powerpc: redefine pXd_huge() with pXd_leaf() · 460b9adc
      Peter Xu authored
      PowerPC book3s 4K mostly has the same definition on both, except
      pXd_huge() constantly returns 0 for hash MMUs.  As Michael Ellerman
      pointed out [1], it is safe to check _PAGE_PTE on hash MMUs, as the bit
      will never be set so it will keep returning false.
      
      As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
      such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
      pgtable walker __find_linux_pte() already used pXd_leaf() to check leaf
      hugetlb mappings.
      
      The goal should be that we will have one API pXd_leaf() to detect all
      kinds of huge mappings (hugepd is still special in this case, though). 
      AFAICT we need to use the pXd_leaf() impl (rather than pXd_huge()'s) to
      make sure ie.  THPs on hash MMU will also return true.
      
      This helps to simplify a follow up patch to drop pXd_huge() treewide.
      
      NOTE: *_leaf() definition need to be moved before the inclusion of
      asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it.
      
      [1] https://lore.kernel.org/r/87v85zo6w7.fsf@mail.lhotse
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      460b9adc
    • Peter Xu's avatar
      mm/arm64: merge pXd_huge() and pXd_leaf() definitions · 961a6ee5
      Peter Xu authored
      Unlike most archs, aarch64 defines pXd_huge() and pXd_leaf() slightly
      differently.  Redefine the pXd_huge() with pXd_leaf().
      
      There used to be two traps for old aarch64 definitions over these APIs that
      I found when reading the code around, they're:
      
       (1) 4797ec2d ("arm64: fix pud_huge() for 2-level pagetables")
       (2) 23bc8f69 ("arm64: mm: fix p?d_leaf()")
      
      Define pXd_huge() with the current pXd_leaf() will make sure (2) isn't a
      problem (on PROT_NONE checks).  To make sure it also works for (1), we
      move over the __PAGETABLE_PMD_FOLDED check to pud_leaf(), allowing it to
      constantly returning "false" for 2-level pgtables, which looks even safer
      to cover both now.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-9-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      961a6ee5
    • Peter Xu's avatar
      mm/arm: redefine pmd_huge() with pmd_leaf() · 6818135d
      Peter Xu authored
      Most of the archs already define these two APIs the same way.  ARM is more
      complicated in two aspects:
      
        - For pXd_huge() it's always checking against !PXD_TABLE_BIT, while for
          pXd_leaf() it's always checking against PXD_TYPE_SECT.
      
        - SECT/TABLE bits are defined differently on 2-level v.s. 3-level ARM
          pgtables, which makes the whole thing even harder to follow.
      
      Luckily, the second complexity should be hidden by the pmd_leaf()
      implementation against 2-level v.s. 3-level headers.  Invoke pmd_leaf()
      directly for pmd_huge(), to remove the first part of complexity.  This
      prepares to drop pXd_huge() API globally.
      
      When at it, drop the obsolete comments - it's outdated.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6818135d
    • Peter Xu's avatar
      mm/arm: use macros to define pmd/pud helpers · 7966a2b7
      Peter Xu authored
      It's already confusing that ARM 2-level v.s.  3-level defines SECT bit
      differently on pmd/puds.  Always use a macro which is much clearer.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7966a2b7
    • Peter Xu's avatar
      mm/sparc: change pXd_huge() behavior to exclude swap entries · ae798490
      Peter Xu authored
      Please refer to the previous patch on the reasoning for x86.  Now sparc is
      the only architecture that will allow swap entries to be reported as
      pXd_huge().  After this patch, all architectures should forbid swap
      entries in pXd_huge().
      
      [akpm@linux-foundation.org: s/;;/;/, per Muchun]
      Link: https://lkml.kernel.org/r/20240318200404.448346-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ae798490
    • Peter Xu's avatar
      mm/x86: change pXd_huge() behavior to exclude swap entries · d0973cb9
      Peter Xu authored
      This patch partly reverts below commits:
      
      3a194f3f ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry")
      cbef8478 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage")
      
      Right now, pXd_huge() definition across kernel is unclear. We have two
      groups that think differently on swap entries:
      
        - x86/sparc:     Allow pXd_huge() to accept swap entries
        - all the rest:  Doesn't allow pXd_huge() to accept swap entries
      
      This is so confusing.  Since the sparc helpers seem to be added in 2016,
      which is after x86's (2015), so sparc could have followed a trend.  x86
      proposed such swap handling in 2015 to resolve hugetlb swap entries hit in
      GUP, but now GUP guards swap entries with !pXd_present() in all layers so
      we should be safe.
      
      We should define this API properly, one way or another, rather than keep
      them defined differently across archs.
      
      Gut feeling tells me that pXd_huge() shouldn't include swap entries, and it
      turns out that I am not the only one thinking so, the question was raised
      when the current pmd_huge() for x86 was proposed by Ville Syrjälä:
      
      https://lore.kernel.org/all/Y2WQ7I4LXh8iUIRd@intel.com/
      
        I might also be missing something obvious, but why is it even necessary
        to treat PRESENT==0+PSE==0 as a huge entry?
      
      It is also questioned when Jason Gunthorpe reviewed the other patchset on
      swap entry handlings:
      
      https://lore.kernel.org/all/20240221125753.GQ13330@nvidia.com/
      
      Revert its meaning back to original.  It shouldn't have any functional
      change as we should be ready with guards on !pXd_present() explicitly
      everywhere.
      
      Note that I also dropped the "#if CONFIG_PGTABLE_LEVELS > 2", it was there
      probably because it was breaking things when 3a194f3f was proposed,
      according to the report here:
      
      https://lore.kernel.org/all/Y2LYXItKQyaJTv8j@intel.com/
      
      Now we shouldn't need that.
      
      Instead of reverting to _PAGE_PSE raw check, leverage pXd_leaf().
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0973cb9
    • Peter Xu's avatar
      mm/gup: check p4d presence before going on · 089f9214
      Peter Xu authored
      Currently there should have no p4d swap entries so it may not matter much,
      however this may help us to rule out swap entries in pXd_huge() API, which
      will include p4d_huge().  The p4d_present() checks make it 100% clear that
      we won't rely on p4d_huge() for swap entries.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      089f9214
    • Peter Xu's avatar
      mm/gup: cache p4d in follow_p4d_mask() · e6fd5564
      Peter Xu authored
      Add a variable to cache p4d in follow_p4d_mask().  It's a good practise to
      make sure all the following checks will have a consistent view of the
      entry.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6fd5564
    • Peter Xu's avatar
      mm/hmm: process pud swap entry without pud_huge() · 9abc71b4
      Peter Xu authored
      Swap pud entries do not always return true for pud_huge() for all archs. 
      x86 and sparc (so far) allow it, but all the rest do not accept a swap
      entry to be reported as pud_huge().  So it's not safe to check swap
      entries within pud_huge().  Check swap entries before pud_huge(), so it
      should be always safe.
      
      This is the only place in the kernel that (IMHO, wrongly) relies on
      pud_huge() to return true on pud swap entries.  The plan is to cleanup
      pXd_huge() to only report non-swap mappings for all archs.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9abc71b4
    • Lucas Stach's avatar
      mm: page_alloc: control latency caused by zone PCP draining · 55f77df7
      Lucas Stach authored
      Patch series "mm/treewide: Remove pXd_huge() API", v2.
      
      In previous work [1], we removed the pXd_large() API, which is arch
      specific.  This patchset further removes the hugetlb pXd_huge() API.
      
      Hugetlb was never special on creating huge mappings when compared with
      other huge mappings.  Having a standalone API just to detect such pgtable
      entries is more or less redundant, especially after the pXd_leaf() API set
      is introduced with/without CONFIG_HUGETLB_PAGE.
      
      When looking at this problem, a few issues are also exposed that we don't
      have a clear definition of the *_huge() variance API.  This patchset
      started by cleaning these issues first, then replace all *_huge() users to
      use *_leaf(), then drop all *_huge() code.
      
      On x86/sparc, swap entries will be reported "true" in pXd_huge(), while
      for all the rest archs they're reported "false" instead.  This part is
      done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix,
      but I'll leave that to hmm experts to decide.
      
      Besides, there are three archs (arm, arm64, powerpc) that have slightly
      different definitions between the *_huge() v.s.  *_leaf() variances.  I
      tackled them separately so that it'll be easier for arch experts to chim
      in when necessary.  This part is done in patch 6-9.
      
      The final patches 10-14 do the rest on the final removal, since *_leaf()
      will be the ultimate API in the future, and we seem to have quite some
      confusions on how *_huge() APIs can be defined, provide a rich comment for
      *_leaf() API set to define them properly to avoid future misuse, and
      hopefully that'll also help new archs to start support huge mappings and
      avoid traps (like either swap entries, or PROT_NONE entry checks).
      
      [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com
      
      
      This patch (of 14):
      
      When the complete PCP is drained a much larger number of pages than the
      usual batch size might be freed at once, causing large IRQ and preemption
      latency spikes, as they are all freed while holding the pcp and zone
      spinlocks.
      
      To avoid those latency spikes, limit the number of pages freed in a single
      bulk operation to common batch limits.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.deSigned-off-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55f77df7
    • Dev Jain's avatar
      selftests/mm: virtual_address_range: Switch to ksft_exit_fail_msg · 13e86096
      Dev Jain authored
      mmap() must not succeed in validate_lower_address_hint(), for if it does,
      it is a bug in mmap() itself.  Reflect this behaviour with
      ksft_exit_fail_msg().  While at it, do some formatting changes.
      
      Link: https://lkml.kernel.org/r/20240314122250.68534-1-dev.jain@arm.comSigned-off-by: default avatarDev Jain <dev.jain@arm.com>
      Reviewed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13e86096
    • David Hildenbrand's avatar
      mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE) · fa9fcd8b
      David Hildenbrand authored
      We changed faultin_page_range() to no longer consume a VMA, because
      faultin_page_range() might internally release the mm lock to lookup
      the VMA again -- required to cleanly handle VM_FAULT_RETRY. But
      independent of that, __get_user_pages() will always lookup the VMA
      itself.
      
      Now that we let __get_user_pages() just handle VMA checks in a way that
      is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise()
      is just overhead. So let's just call madvise_populate()
      on the full range instead.
      
      There is one change in behavior: madvise_walk_vmas() would skip any VMA
      holes, and if everything succeeded, it would return -ENOMEM after
      processing all VMAs.
      
      However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to
      notice any difference: -ENOMEM might either indicate that there were VMA
      holes or that populating page tables failed because there was not enough
      memory. So it's unlikely that user space will notice the difference, and
      that special handling likely only makes sense for some other madvise()
      actions.
      
      Further, we'd already fail with -ENOMEM early in the past if looking up the
      VMA after dropping the MM lock failed because of concurrent VMA
      modifications. So let's just keep it simple and avoid the madvise VMA
      walk, and consistently fail early if we find a VMA hole.
      
      Link: https://lkml.kernel.org/r/20240314161300.382526-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa9fcd8b
    • Yosry Ahmed's avatar
      mm: memcg: add NULL check to obj_cgroup_put() · 91b71e78
      Yosry Ahmed authored
      9 out of 16 callers perform a NULL check before calling obj_cgroup_put(). 
      Move the NULL check in the function, similar to mem_cgroup_put().  The
      unlikely() NULL check in current_objcg_update() was left alone to avoid
      dropping the unlikey() annotation as this a fast path.
      
      Link: https://lkml.kernel.org/r/20240316015803.2777252-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91b71e78
    • Christophe Leroy's avatar
      mm: remove guard around pgd_offset_k() macro · 5b0a6700
      Christophe Leroy authored
      The last architecture redefining pgd_offset_k() was IA64 and it was
      removed by commit cf8e8658 ("arch: Remove Itanium (IA-64)
      architecture")
      
      There is no need anymore to guard generic version of pgd_offset_k()
      with #ifndef pgd_offset_k
      
      Link: https://lkml.kernel.org/r/59d3f47d5615d18cca1986f269be2fcb3df34556.1710589838.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b0a6700
    • Andrew Morton's avatar
  2. 25 Apr, 2024 11 commits
  3. 16 Apr, 2024 4 commits