Commit c48c43e6 authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] minimal rmap

This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig
Kulsea.

Basically,

before: When the page reclaim code decides that is has scanned too many
unreclaimable pages on the LRU it does a scan of process virtual
address spaces for pages to add to swapcache.  ptes pointing at the
page are unmapped as the scan proceeds.  When all ptes referring to a
page have been unmapped and it has been written to swap the page is
reclaimable.

after: When an anonymous page is encountered on the tail of the LRU we
use the rmap to see if it hasn't been referenced lately.  If so then
add it to swapcache.  When the page is again encountered on the LRU, if
it is still unreferenced then try to unmap all ptes which refer to it
in one hit, and if it is clean (ie: on swap) then free it.

The rest of the VM - list management, the classzone concept, etc
remains unchanged.

There are a number of things which the per-page pte chain could be
used for.  Bill Irwin has identified the following.


(1)  page replacement no longer goes around randomly unmapping things

(2)  referenced bits are more accurate because there aren't several ms
        or even seconds between find the multiple pte's mapping a page

(3)  reduces page replacement from O(total virtually mapped) to O(physical)

(4)  enables defragmentation of physical memory

(5)  enables cooperative offlining of memory for friendly guest instance
        behavior in UML and/or LPAR settings

(6)  demonstrable benefit in performance of swapping which is common in
        end-user interactive workstation workloads (I don't like the word
        "desktop"). c.f. Craig Kulesa's post wrt. swapping performance

(7)  evidence from 2.4-based rmap trees indicates approximate parity
        with mainline in kernel compiles with appropriate locking bits

(8)  partitioning of physical memory can reduce the complexity of page
        replacement searches by scanning only the "interesting" zones
        implemented and merged in 2.4-based rmap

(9)  partitioning of physical memory can increase the parallelism of page
        replacement searches by independently processing different zones
        implemented, but not merged in 2.4-based rmap

(10) the reverse mappings may be used for efficiently keeping pte cache
        attributes coherent

(11) they may be used for virtual cache invalidation (with changes)

(12) the reverse mappings enable proper RSS limit enforcement
        implemented and merged in 2.4-based rmap



The code adds a pointer to struct page, consumes additional storage for
the pte chains and adds computational expense to the page reclaim code
(I measured it at 3% additional load during streaming I/O).  The
benefits which we get back for all this are, I must say, theoretical
and unproven.  If it has real advantages (or, indeed, disadvantages)
then why has nobody demonstrated them?



There are a number of things remaining to be done:

1: Demonstrate the above advantages.

2: Make it work with pte-highmem  (Bill Irwin is signed up for this)

3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's
   patch does this)

4: Move the pte_chains into highmem too (Bill, I guess)

5: per-cpu pte_chain freelists (Rik?)

6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)

7: multithread the page reclaim code.  (I have patches).

8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
   often well-ordered-by-virtual-address on the LRU, so it "just
   works" for benchmarky loads.  But there may be some other loads...

9: Fix bad IO latency in page reclaim (I have lame patches)

10: Develop tuning tools, use them.

11: The nightly updatedb run is still evicting everything.
parent b15d45bf
......@@ -36,6 +36,7 @@
#include <linux/spinlock.h>
#include <linux/personality.h>
#include <linux/binfmts.h>
#include <linux/swap.h>
#define __NO_VERSION__
#include <linux/module.h>
#include <linux/namei.h>
......@@ -283,6 +284,7 @@ void put_dirty_page(struct task_struct * tsk, struct page *page, unsigned long a
flush_dcache_page(page);
flush_page_to_ram(page);
set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(page, PAGE_COPY))));
page_add_rmap(page, pte);
pte_unmap(pte);
tsk->mm->rss++;
spin_unlock(&tsk->mm->page_table_lock);
......
#ifndef _ALPHA_RMAP_H
#define _ALPHA_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _ARMV_RMAP_H
#define _ARMV_RMAP_H
/*
* linux/include/asm-arm/proc-armv/rmap.h
*
* Architecture dependant parts of the reverse mapping code,
*
* ARM is different since hardware page tables are smaller than
* the page size and Linux uses a "duplicate" one with extra info.
* For rmap this means that the first 2 kB of a page are the hardware
* page tables and the last 2 kB are the software page tables.
*/
static inline void pgtable_add_rmap(pte_t * ptep, struct mm_struct * mm, unsigned long address)
{
struct page * page = virt_to_page(ptep);
page->mm = mm;
page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
}
static inline void pgtable_remove_rmap(pte_t * ptep)
{
struct page * page = virt_to_page(ptep);
page->mm = NULL;
page->index = 0;
}
static inline struct mm_struct * ptep_to_mm(pte_t * ptep)
{
struct page * page = virt_to_page(ptep);
return page->mm;
}
/* The page table takes half of the page */
#define PTE_MASK ((PAGE_SIZE / 2) - 1)
static inline unsigned long ptep_to_address(pte_t * ptep)
{
struct page * page = virt_to_page(ptep);
unsigned long low_bits;
low_bits = ((unsigned long)ptep & PTE_MASK) * PTRS_PER_PTE;
return page->index + low_bits;
}
#endif /* _ARMV_RMAP_H */
#ifndef _ARM_RMAP_H
#define _ARM_RMAP_H
#include <asm/proc/rmap.h>
#endif /* _ARM_RMAP_H */
#ifndef _CRIS_RMAP_H
#define _CRIS_RMAP_H
/* nothing to see, move along :) */
#include <asm-generic/rmap.h>
#endif
#ifndef _GENERIC_RMAP_H
#define _GENERIC_RMAP_H
/*
* linux/include/asm-generic/rmap.h
*
* Architecture dependant parts of the reverse mapping code,
* this version should work for most architectures with a
* 'normal' page table layout.
*
* We use the struct page of the page table page to find out
* the process and full address of a page table entry:
* - page->mapping points to the process' mm_struct
* - page->index has the high bits of the address
* - the lower bits of the address are calculated from the
* offset of the page table entry within the page table page
*/
#include <linux/mm.h>
static inline void pgtable_add_rmap(struct page * page, struct mm_struct * mm, unsigned long address)
{
#ifdef BROKEN_PPC_PTE_ALLOC_ONE
/* OK, so PPC calls pte_alloc() before mem_map[] is setup ... ;( */
extern int mem_init_done;
if (!mem_init_done)
return;
#endif
page->mapping = (void *)mm;
page->index = address & ~((PTRS_PER_PTE * PAGE_SIZE) - 1);
}
static inline void pgtable_remove_rmap(struct page * page)
{
page->mapping = NULL;
page->index = 0;
}
static inline struct mm_struct * ptep_to_mm(pte_t * ptep)
{
struct page * page = virt_to_page(ptep);
return (struct mm_struct *) page->mapping;
}
static inline unsigned long ptep_to_address(pte_t * ptep)
{
struct page * page = virt_to_page(ptep);
unsigned long low_bits;
low_bits = ((unsigned long)ptep & ~PAGE_MASK) * PTRS_PER_PTE;
return page->index + low_bits;
}
#endif /* _GENERIC_RMAP_H */
#ifndef _I386_RMAP_H
#define _I386_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _IA64_RMAP_H
#define _IA64_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _M68K_RMAP_H
#define _M68K_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _MIPS_RMAP_H
#define _MIPS_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _MIPS64_RMAP_H
#define _MIPS64_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _PARISC_RMAP_H
#define _PARISC_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _PPC_RMAP_H
#define _PPC_RMAP_H
/* PPC calls pte_alloc() before mem_map[] is setup ... */
#define BROKEN_PPC_PTE_ALLOC_ONE
#include <asm-generic/rmap.h>
#endif
#ifndef _S390_RMAP_H
#define _S390_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _S390X_RMAP_H
#define _S390X_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _SH_RMAP_H
#define _SH_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _SPARC_RMAP_H
#define _SPARC_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
#ifndef _SPARC64_RMAP_H
#define _SPARC64_RMAP_H
/* nothing to see, move along */
#include <asm-generic/rmap.h>
#endif
......@@ -130,6 +130,9 @@ struct vm_operations_struct {
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
};
/* forward declaration; pte_chain is meant to be internal to rmap.c */
struct pte_chain;
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
......@@ -154,6 +157,8 @@ struct page {
updated asynchronously */
struct list_head lru; /* Pageout list, eg. active_list;
protected by pagemap_lru_lock !! */
struct pte_chain * pte_chain; /* Reverse pte mapping pointer.
* protected by PG_chainlock */
unsigned long private; /* mapping-private opaque data */
/*
......
......@@ -47,7 +47,7 @@
* locked- and dirty-page accounting. The top eight bits of page->flags are
* used for page->zone, so putting flag bits there doesn't work.
*/
#define PG_locked 0 /* Page is locked. Don't touch. */
#define PG_locked 0 /* Page is locked. Don't touch. */
#define PG_error 1
#define PG_referenced 2
#define PG_uptodate 3
......@@ -65,6 +65,7 @@
#define PG_private 12 /* Has something at ->private */
#define PG_writeback 13 /* Page is under writeback */
#define PG_nosave 15 /* Used for system suspend/resume */
#define PG_chainlock 16 /* lock bit for ->pte_chain */
/*
* Global page accounting. One instance per CPU.
......@@ -216,6 +217,31 @@ extern void get_page_state(struct page_state *ret);
#define ClearPageNosave(page) clear_bit(PG_nosave, &(page)->flags)
#define TestClearPageNosave(page) test_and_clear_bit(PG_nosave, &(page)->flags)
/*
* inlines for acquisition and release of PG_chainlock
*/
static inline void pte_chain_lock(struct page *page)
{
/*
* Assuming the lock is uncontended, this never enters
* the body of the outer loop. If it is contended, then
* within the inner loop a non-atomic test is used to
* busywait with less bus contention for a good time to
* attempt to acquire the lock bit.
*/
preempt_disable();
while (test_and_set_bit(PG_chainlock, &page->flags)) {
while (test_bit(PG_chainlock, &page->flags))
cpu_relax();
}
}
static inline void pte_chain_unlock(struct page *page)
{
clear_bit(PG_chainlock, &page->flags);
preempt_enable();
}
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
* but it may again do so one day.
......
......@@ -142,6 +142,19 @@ struct sysinfo;
struct address_space;
struct zone_t;
/* linux/mm/rmap.c */
extern int FASTCALL(page_referenced(struct page *));
extern void FASTCALL(page_add_rmap(struct page *, pte_t *));
extern void FASTCALL(page_remove_rmap(struct page *, pte_t *));
extern int FASTCALL(try_to_unmap(struct page *));
extern int FASTCALL(page_over_rsslimit(struct page *));
/* return values of try_to_unmap */
#define SWAP_SUCCESS 0
#define SWAP_AGAIN 1
#define SWAP_FAIL 2
#define SWAP_ERROR 3
/* linux/mm/swap.c */
extern void FASTCALL(lru_cache_add(struct page *));
extern void FASTCALL(__lru_cache_del(struct page *));
......@@ -168,6 +181,7 @@ int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page);
extern void show_swap_cache_info(void);
#endif
extern int add_to_swap_cache(struct page *, swp_entry_t);
extern int add_to_swap(struct page *);
extern void __delete_from_swap_cache(struct page *page);
extern void delete_from_swap_cache(struct page *page);
extern int move_to_swap_cache(struct page *page, swp_entry_t entry);
......
......@@ -189,7 +189,6 @@ static inline int dup_mmap(struct mm_struct * mm)
mm->map_count = 0;
mm->rss = 0;
mm->cpu_vm_mask = 0;
mm->swap_address = 0;
pprev = &mm->mmap;
/*
......@@ -308,9 +307,6 @@ inline void __mmdrop(struct mm_struct *mm)
void mmput(struct mm_struct *mm)
{
if (atomic_dec_and_lock(&mm->mm_users, &mmlist_lock)) {
extern struct mm_struct *swap_mm;
if (swap_mm == mm)
swap_mm = list_entry(mm->mmlist.next, struct mm_struct, mmlist);
list_del(&mm->mmlist);
mmlist_nr--;
spin_unlock(&mmlist_lock);
......
......@@ -16,6 +16,6 @@ obj-y := memory.o mmap.o filemap.o mprotect.o mlock.o mremap.o \
vmalloc.o slab.o bootmem.o swap.o vmscan.o page_io.o \
page_alloc.o swap_state.o swapfile.o numa.o oom_kill.o \
shmem.o highmem.o mempool.o msync.o mincore.o readahead.o \
pdflush.o page-writeback.o
pdflush.o page-writeback.o rmap.o
include $(TOPDIR)/Rules.make
......@@ -176,14 +176,13 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
*/
static void truncate_complete_page(struct page *page)
{
/* Leave it on the LRU if it gets converted into anonymous buffers */
if (!PagePrivate(page) || do_invalidatepage(page, 0)) {
lru_cache_del(page);
} else {
/* Drop fs-specific data so the page might become freeable. */
if (PagePrivate(page) && !do_invalidatepage(page, 0)) {
if (current->flags & PF_INVALIDATE)
printk("%s: buffer heads were leaked\n",
current->comm);
}
ClearPageDirty(page);
ClearPageUptodate(page);
remove_inode_page(page);
......@@ -660,7 +659,7 @@ EXPORT_SYMBOL(wait_on_page_bit);
* But that's OK - sleepers in wait_on_page_writeback() just go back to sleep.
*
* The first mb is necessary to safely close the critical section opened by the
* TryLockPage(), the second mb is necessary to enforce ordering between
* TestSetPageLocked(), the second mb is necessary to enforce ordering between
* the clear_bit and the read of the waitqueue (to avoid SMP races with a
* parallel wait_on_page_locked()).
*/
......
......@@ -46,6 +46,7 @@
#include <linux/pagemap.h>
#include <asm/pgalloc.h>
#include <asm/rmap.h>
#include <asm/uaccess.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
......@@ -79,7 +80,7 @@ struct page *mem_map;
*/
static inline void free_one_pmd(mmu_gather_t *tlb, pmd_t * dir)
{
struct page *pte;
struct page *page;
if (pmd_none(*dir))
return;
......@@ -88,9 +89,10 @@ static inline void free_one_pmd(mmu_gather_t *tlb, pmd_t * dir)
pmd_clear(dir);
return;
}
pte = pmd_page(*dir);
page = pmd_page(*dir);
pmd_clear(dir);
pte_free_tlb(tlb, pte);
pgtable_remove_rmap(page);
pte_free_tlb(tlb, page);
}
static inline void free_one_pgd(mmu_gather_t *tlb, pgd_t * dir)
......@@ -150,6 +152,7 @@ pte_t * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
pte_free(new);
goto out;
}
pgtable_add_rmap(new, mm, address);
pmd_populate(mm, pmd, new);
}
out:
......@@ -177,6 +180,7 @@ pte_t * pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address
pte_free_kernel(new);
goto out;
}
pgtable_add_rmap(virt_to_page(new), mm, address);
pmd_populate_kernel(mm, pmd, new);
}
out:
......@@ -260,10 +264,13 @@ skip_copy_pte_range: address = (address + PMD_SIZE) & PMD_MASK;
if (pte_none(pte))
goto cont_copy_pte_range_noset;
/* pte contains position in swap, so copy. */
if (!pte_present(pte)) {
swap_duplicate(pte_to_swp_entry(pte));
goto cont_copy_pte_range;
set_pte(dst_pte, pte);
goto cont_copy_pte_range_noset;
}
ptepage = pte_page(pte);
pfn = pte_pfn(pte);
if (!pfn_valid(pfn))
goto cont_copy_pte_range;
......@@ -272,7 +279,7 @@ skip_copy_pte_range: address = (address + PMD_SIZE) & PMD_MASK;
goto cont_copy_pte_range;
/* If it's a COW mapping, write protect it both in the parent and the child */
if (cow && pte_write(pte)) {
if (cow) {
ptep_set_wrprotect(src_pte);
pte = *src_pte;
}
......@@ -285,6 +292,7 @@ skip_copy_pte_range: address = (address + PMD_SIZE) & PMD_MASK;
dst->rss++;
cont_copy_pte_range: set_pte(dst_pte, pte);
page_add_rmap(ptepage, dst_pte);
cont_copy_pte_range_noset: address += PAGE_SIZE;
if (address >= end) {
pte_unmap_nested(src_pte);
......@@ -342,6 +350,7 @@ static void zap_pte_range(mmu_gather_t *tlb, pmd_t * pmd, unsigned long address,
if (pte_dirty(pte))
set_page_dirty(page);
tlb->freed++;
page_remove_rmap(page, ptep);
tlb_remove_page(tlb, page);
}
}
......@@ -992,7 +1001,9 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
if (pte_same(*page_table, pte)) {
if (PageReserved(old_page))
++mm->rss;
page_remove_rmap(old_page, page_table);
break_cow(vma, new_page, address, page_table);
page_add_rmap(new_page, page_table);
lru_cache_add(new_page);
/* Free the old page.. */
......@@ -1199,6 +1210,7 @@ static int do_swap_page(struct mm_struct * mm,
flush_page_to_ram(page);
flush_icache_page(vma, page);
set_pte(page_table, pte);
page_add_rmap(page, page_table);
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
......@@ -1215,14 +1227,13 @@ static int do_swap_page(struct mm_struct * mm,
static int do_anonymous_page(struct mm_struct * mm, struct vm_area_struct * vma, pte_t *page_table, pmd_t *pmd, int write_access, unsigned long addr)
{
pte_t entry;
struct page * page = ZERO_PAGE(addr);
/* Read-only mapping of ZERO_PAGE. */
entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
/* ..except if it's a write access */
if (write_access) {
struct page *page;
/* Allocate our own private page. */
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
......@@ -1248,6 +1259,7 @@ static int do_anonymous_page(struct mm_struct * mm, struct vm_area_struct * vma,
}
set_pte(page_table, entry);
page_add_rmap(page, page_table); /* ignores ZERO_PAGE */
pte_unmap(page_table);
/* No need to invalidate - it was non-present before */
......@@ -1327,6 +1339,7 @@ static int do_no_page(struct mm_struct * mm, struct vm_area_struct * vma,
if (write_access)
entry = pte_mkwrite(pte_mkdirty(entry));
set_pte(page_table, entry);
page_add_rmap(new_page, page_table);
pte_unmap(page_table);
} else {
/* One of our sibling threads was faster, back out. */
......
......@@ -68,8 +68,14 @@ static inline int copy_one_pte(struct mm_struct *mm, pte_t * src, pte_t * dst)
{
int error = 0;
pte_t pte;
struct page * page = NULL;
if (pte_present(*src))
page = pte_page(*src);
if (!pte_none(*src)) {
if (page)
page_remove_rmap(page, src);
pte = ptep_get_and_clear(src);
if (!dst) {
/* No dest? We must put it back. */
......@@ -77,6 +83,8 @@ static inline int copy_one_pte(struct mm_struct *mm, pte_t * src, pte_t * dst)
error++;
}
set_pte(dst, pte);
if (page)
page_add_rmap(page, dst);
}
return error;
}
......
......@@ -92,6 +92,7 @@ static void __free_pages_ok (struct page *page, unsigned int order)
BUG_ON(PageLRU(page));
BUG_ON(PageActive(page));
BUG_ON(PageWriteback(page));
BUG_ON(page->pte_chain != NULL);
if (PageDirty(page))
ClearPageDirty(page);
BUG_ON(page_count(page) != 0);
......
This diff is collapsed.
......@@ -105,6 +105,69 @@ void __delete_from_swap_cache(struct page *page)
INC_CACHE_INFO(del_total);
}
/**
* add_to_swap - allocate swap space for a page
* @page: page we want to move to swap
*
* Allocate swap space for the page and add the page to the
* swap cache. Caller needs to hold the page lock.
*/
int add_to_swap(struct page * page)
{
swp_entry_t entry;
int flags;
if (!PageLocked(page))
BUG();
for (;;) {
entry = get_swap_page();
if (!entry.val)
return 0;
/* Radix-tree node allocations are performing
* GFP_ATOMIC allocations under PF_MEMALLOC.
* They can completely exhaust the page allocator.
*
* So PF_MEMALLOC is dropped here. This causes the slab
* allocations to fail earlier, so radix-tree nodes will
* then be allocated from the mempool reserves.
*
* We're still using __GFP_HIGH for radix-tree node
* allocations, so some of the emergency pools are available,
* just not all of them.
*/
flags = current->flags;
current->flags &= ~PF_MEMALLOC;
current->flags |= PF_NOWARN;
ClearPageUptodate(page); /* why? */
/*
* Add it to the swap cache and mark it dirty
* (adding to the page cache will clear the dirty
* and uptodate bits, so we need to do it again)
*/
switch (add_to_swap_cache(page, entry)) {
case 0: /* Success */
current->flags = flags;
SetPageUptodate(page);
set_page_dirty(page);
swap_free(entry);
return 1;
case -ENOMEM: /* radix-tree allocation */
current->flags = flags;
swap_free(entry);
return 0;
default: /* ENOENT: raced */
break;
}
/* Raced with "speculative" read_swap_cache_async */
current->flags = flags;
swap_free(entry);
}
}
/*
* This must be called only on pages that have
* been verified to be in the swap cache and locked.
......
......@@ -383,6 +383,7 @@ static inline void unuse_pte(struct vm_area_struct * vma, unsigned long address,
return;
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
page_add_rmap(page, dir);
swap_free(entry);
++vma->vm_mm->rss;
}
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment