Commit 2157e7b8 authored by Alexey Kardashevskiy's avatar Alexey Kardashevskiy Committed by Michael Ellerman

vfio: powerpc/spapr: Register memory and define IOMMU v2

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.
Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
parent 15b244a8
...@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note ...@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
This implementation has some specifics: This implementation has some specifics:
1) Only one IOMMU group per container is supported as an IOMMU group 1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
represents the minimal entity which isolation can be guaranteed for and container is supported as an IOMMU table is allocated at the boot time,
groups are allocated statically, one per a Partitionable Endpoint (PE) one table per a IOMMU group which is a Partitionable Endpoint (PE)
(PE is often a PCI domain but not always). (PE is often a PCI domain but not always).
Newer systems (POWER8 with IODA2) have improved hardware design which allows
to remove this limitation and have multiple IOMMU groups per a VFIO container.
2) The hardware supports so called DMA windows - the PCI address range 2) The hardware supports so called DMA windows - the PCI address range
within which DMA transfer is allowed, any attempt to access address space within which DMA transfer is allowed, any attempt to access address space
...@@ -439,6 +441,29 @@ The code flow from the example above should be slightly changed: ...@@ -439,6 +441,29 @@ The code flow from the example above should be slightly changed:
.... ....
5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
VFIO_IOMMU_DISABLE and implements 2 new ioctls:
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
(which are unsupported in v1 IOMMU).
PPC64 paravirtualized guests generate a lot of map/unmap requests,
and the handling of those includes pinning/unpinning pages and updating
mm::locked_vm counter to make sure we do not exceed the rlimit.
The v2 IOMMU splits accounting and pinning into separate operations:
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
receive a user space address and size of the block to be pinned.
Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
be called with the exact address and size used for registering
the memory block. The userspace is not expected to call these often.
The ranges are stored in a linked list in a VFIO container.
- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
IOMMU table and do not do pinning; instead these check that the userspace
address is from pre-registered range.
This separation helps in optimizing DMA for guests.
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
[1] VFIO was originally an acronym for "Virtual Function I/O" in its [1] VFIO was originally an acronym for "Virtual Function I/O" in its
......
...@@ -112,9 +112,15 @@ struct iommu_table { ...@@ -112,9 +112,15 @@ struct iommu_table {
unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long *it_map; /* A simple allocation bitmap for now */
unsigned long it_page_shift;/* table iommu page size */ unsigned long it_page_shift;/* table iommu page size */
struct list_head it_group_list;/* List of iommu_table_group_link */ struct list_head it_group_list;/* List of iommu_table_group_link */
unsigned long *it_userspace; /* userspace view of the table */
struct iommu_table_ops *it_ops; struct iommu_table_ops *it_ops;
}; };
#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
((tbl)->it_userspace ? \
&((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \
NULL)
/* Pure 2^n version of get_order */ /* Pure 2^n version of get_order */
static inline __attribute_const__ static inline __attribute_const__
int get_iommu_order(unsigned long size, struct iommu_table *tbl) int get_iommu_order(unsigned long size, struct iommu_table *tbl)
......
This diff is collapsed.
...@@ -36,6 +36,8 @@ ...@@ -36,6 +36,8 @@
/* Two-stage IOMMU */ /* Two-stage IOMMU */
#define VFIO_TYPE1_NESTING_IOMMU 6 /* Implies v2 */ #define VFIO_TYPE1_NESTING_IOMMU 6 /* Implies v2 */
#define VFIO_SPAPR_TCE_v2_IOMMU 7
/* /*
* The IOCTL interface is designed for extensibility by embedding the * The IOCTL interface is designed for extensibility by embedding the
* structure length (argsz) and flags into structures passed between * structure length (argsz) and flags into structures passed between
...@@ -507,6 +509,31 @@ struct vfio_eeh_pe_op { ...@@ -507,6 +509,31 @@ struct vfio_eeh_pe_op {
#define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21) #define VFIO_EEH_PE_OP _IO(VFIO_TYPE, VFIO_BASE + 21)
/**
* VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
*
* Registers user space memory where DMA is allowed. It pins
* user pages and does the locked memory accounting so
* subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
* get faster.
*/
struct vfio_iommu_spapr_register_memory {
__u32 argsz;
__u32 flags;
__u64 vaddr; /* Process virtual address */
__u64 size; /* Size of mapping (bytes) */
};
#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 17)
/**
* VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
*
* Unregisters user space memory registered with
* VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
* Uses vfio_iommu_spapr_register_memory for parameters.
*/
#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY _IO(VFIO_TYPE, VFIO_BASE + 18)
/* ***************************************************************** */ /* ***************************************************************** */
#endif /* _UAPIVFIO_H */ #endif /* _UAPIVFIO_H */
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment